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ETAPS Foreword 


Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital 
of Bavaria, in Germany. 

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and 
Practice of Software. ETAPS is an annual federated conference established in 1998, 
and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organizing these conferences in a coherent, 
highly synchronized conference program enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops took place that 
attract many researchers from all over the globe. 

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, 
yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University 
College London, UK, and Cornell University, USA) and Tomas Vojnar (Brno 
University of Technology, Czech Republic) and the conference-specific invited 
speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck 
(University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by 
Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and 
Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated 
learning. 

As this event was the 25th edition of ETAPS, part of the program was a special 
celebration where we looked back on the achievements of ETAPS and its constituting 
conferences in the past, but we also looked into the future, and discussed the challenges 
ahead for research in software science. This edition also reinstated the ETAPS men- 
toring workshop for PhD students. 

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the 
Technical University of Munich (TUM) and the LMU Munich. The former was 
founded in 1868, and the latter in 1472 as the 6th oldest German university still running 
today. Together, they have 100,000 enrolled students, regularly rank among the top 
100 universities worldwide (with TUM’s computer-science department ranked #1 in 
the European Union), and their researchers and alumni include 60 Nobel laureates. 
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The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer 
(general, financial, and workshop chair), Julia Eisentraut (organization chair), and 
Alexandros Evangelidis (local proceedings chair). 

ETAPS 2022 was further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 

The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns 
(Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik 
and Tallinn), and Lenore Zuck (Chicago). 

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch 
Johnsen (Oslo), Dana Fisman (Be’er Sheva), Reiko Heckel (Leicester), Joost-Pieter 
Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna 
Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), 
Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Rosu (Illinois), 
Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella 
(Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina 
(Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastian Uchitel 
(London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), 
Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz). 

Pd like to take this opportunity to thank all authors, attendees, organizers of the 
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all 
enjoyed ETAPS 2022. 

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their 
enormous efforts to make ETAPS a fantastic event. 


February 2022 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


This volume contains the papers accepted at the 31st European Symposium on 
Programming (ESOP 2022), held during April 5-7, 2022, in Munich, Germany 
(COVID-19 permitting). ESOP is one of the European Joint Conferences on Theory 
and Practice of Software (ETAPS); it is dedicated to fundamental issues in the spec- 
ification, design, analysis, and implementation of programming languages and systems. 

The 21 papers in this volume were selected by the Program Committee (PC) from 
64 submissions. Each submission received between three and four reviews. After 
receiving the initial reviews, the authors had a chance to respond to questions and 
clarify misunderstandings of the reviewers. After the author response period, the papers 
were discussed electronically using the HotCRP system by the 33 Program Committee 
members and 33 external reviewers. Two papers, for which the PC chair had a conflict 
of interest, were kindly managed by Zena Ariola. The reviewing for ESOP 2022 was 
double-anonymous, and only authors of the eventually accepted papers have been 
revealed. 

Following the example set by other major conferences in programming languages, 
for the first time in its history, ESOP featured optional artifact evaluation. Authors 
of the accepted manuscripts were invited to submit artifacts, such as code, datasets, and 
mechanized proofs, that supported the conclusions of their papers. Members of the 
Artifact Evaluation Committee (AEC) read the papers and explored the artifacts, 
assessing their quality and checking that they supported the authors’ claims. The 
authors of eleven of the accepted papers submitted artifacts, which were evaluated by 
20 AEC members, with each artifact receiving four reviews. Authors of papers with 
accepted artifacts were assigned official EAPLS artifact evaluation badges, indicating 
that they have taken the extra time and have undergone the extra scrutiny to prepare a 
useful artifact. The ESOP 2022 AEC awarded Artifacts Functional and Artifacts 
(Functional and) Reusable badges. All submitted artifacts were deemed Functional, and 
all but one were found to be Reusable. 

My sincere thanks go to all who contributed to the success of the conference and to 
its exciting program. This includes the authors who submitted papers for consideration; 
the external reviewers who provided timely expert reviews sometimes on very short 
notice; the AEC members and chairs who took great care of this new aspect of ESOP; 
and, of course, the members of the ESOP 2022 Program Committee. I was extremely 
impressed by the excellent quality of the reviews, the amount of constructive feedback 
given to the authors, and the criticism delivered in a professional and friendly tone. 
I am very grateful to Andreea Costea and KC Sivaramakrishnan who kindly agreed to 
serve as co-chairs for the ESOP 2022 Artifact Evaluation Committee. I would like to 
thank the ESOP 2021 chair Nobuko Yoshida for her advice, patience, and the many 
insightful discussions on the process of running the conference. I thank all who con- 
tributed to the organization of ESOP: the ESOP steering committee and its chair Peter 
Thiemann, as well as the ETAPS steering committee and its chair Marieke Huisman. 
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Finally, I would like to thank Barbara König and Alexandros Evangelidis for their help 
with assembling the proceedings. 
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Categorical Foundations of Gradient-Based Learning 


Geoffrey S. H. Cruttwell’ (®)\@, Bruno Gavranović? (=)\®, Neil Ghani? (RIO, 
Paul Wilsont (®)®, and Fabio Zanasi* (~)@, 


1? Mount Allison University, Canada 
2 University of Strathclyde, United Kingdom 
3 University College London 


Abstract. We propose a categorical semantics of gradient-based ma- 
chine learning algorithms in terms of lenses, parametric maps, and re- 
verse derivative categories. This foundation provides a powerful explana- 
tory and unifying framework: it encompasses a variety of gradient descent 
algorithms such as ADAM, AdaGrad, and Nesterov momentum, as well 
as a variety of loss functions such as MSE and Softmax cross-entropy, 
shedding new light on their similarities and differences. Our approach to 
gradient-based learning has examples generalising beyond the familiar 
continuous domains (modelled in categories of smooth maps) and can 
be realized in the discrete setting of boolean circuits. Finally, we demon- 
strate the practical significance of our framework with an implementation 
in Python. 


1 Introduction 


The last decade has witnessed a surge of interest in machine learning, fuelled by 
the numerous successes and applications that these methodologies have found in 
many fields of science and technology. As machine learning techniques become 
increasingly pervasive, algorithms and models become more sophisticated, posing 
a significant challenge both to the software developers and the users that need to 
interface, execute and maintain these systems. In spite of this rapidly evolving 
picture, the formal analysis of many learning algorithms mostly takes place at a 
heuristic level (41], or using definitions that fail to provide a general and scalable 
framework for describing machine learning. Indeed, it is commonly acknowledged 
through academia, industry, policy makers and funding agencies that there is a 
pressing need for a unifying perspective, which can make this growing body of 
work more systematic, rigorous, transparent and accessible both for users and 
developers ple]. 

Consider, for example, one of the most common machine learning scenar- 
ios: supervised learning with a neural network. This technique trains the model 
towards a certain task, e.g. the recognition of patterns in a data set (cf. Fig- 
ure|1). There are several different ways of implementing this scenario. Typically, 
at their core, there is a gradient update algorithm (often called the “optimiser” ), 
depending on a given loss function, which updates in steps the parameters of the 
network, based on some learning rate controlling the “scaling” of the update. All 
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of these components can vary independently in a supervised learning algorithm 
and a number of choices is available for loss maps (quadratic error, Softmax 
cross entropy, dot product, etc.) and optimisers (Adagrad [20], Momentum [87], 
and Adam [32], etc.). 


ae Labels 
n Learning Rate IEF 
Input 0 Dog + 
0 Horse 
NIN pe Oime, ay 
oron, 
Er: 
Parameters Won 


Prediction Loss Map 
0.7 Cat + 
0.2 Dog + 
0.1 Horse 


Neural Network 


Fig. 1: An informal illustration of gradient-based learning. This neural network 
is trained to distinguish different kinds of animals in the input image. Given an 
input X, the network predicts an output Y, which is compared by a ‘loss map’ 
with what would be the correct answer (‘label’). The loss map returns a real 
value expressing the error of the prediction; this information, together with the 
learning rate (a weight controlling how much the model should be changed in 
response to error) is used by an optimiser, which computes by gradient-descent 
the update of the parameters of the network, with the aim of improving its 
accuracy. The neural network, the loss map, the optimiser and the learning rate 
are all components of a supervised learning system, and can vary independently 
of one another. 


This scenario highlights several questions: is there a uniform mathemati- 
cal language capturing the different components of the learning process? Can 
we develop a unifying picture of the various optimisation techniques, allowing 
for their comparative analysis? Moreover, it should be noted that supervised 
learning is not limited to neural networks. For example, supervised learning is 
surprisingly applicable to the discrete setting of boolean circuits where con- 
tinuous functions are replaced by boolean-valued functions. Can we identify an 
abstract perspective encompassing both the real-valued and the boolean case? 
In a nutshell, this paper seeks to answer the question: 


what are the fundamental mathematical structures underpinning gradient- 
based learning? 


Our approach to this question stems from the identification of three funda- 
mental aspects of the gradient-descent learning process: 
(I) computation is parametric, e.g. in the simplest case we are given a function 
f: Px X —> Y and learning consists of finding a parameter p : P such 
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that f(p, —) is the best function according to some criteria. Specifically, the 
weights on the internal nodes of a neural network are a parameter which the 
learning is seeking to optimize. Parameters also arise elsewhere, e.g. in the 
loss function (see later). 

(II) information flows bidirectionally: in the forward direction, the computa- 
tion turns inputs via a sequence of layers into predicted outputs, and then 
into a loss value; in the reverse direction, backpropagation is used propa- 
gate the changes backwards through the layers, and then turn them into 
parameter updates. 

(III) the basis of parameter update via gradient descent is differentiation e.g. 
in the simple case we differentiate the function mapping a parameter to its 
associated loss to reduce that loss. 


We model bidirectionality via lenses (6}[12|[29) and based upon the above 
three insights, we propose the notion of parametric lens as the fundamental 
semantic structure of learning. In a nutshell, a parametric lens is a process with 
three kinds of interfaces: inputs, outputs, and parameters. On each interface, 
information flows both ways, i.e. computations are bidirectional. These data 
are best explained with our graphical representation of parametric lenses, with 
inputs A, A’, outputs B, B’, parameters P, P’, and arrows indicating information 
flow (below left). The graphical notation also makes evident that parametric 
lenses are open systems, which may be composed along their interfaces (below 
center and right). 


QQ 


k— Uy 
a 
— 9 


A— > B A C (1) 
A’ B’ Al Cc’ P 


A— > B 


This pictorial formalism is not just an intuitive sketch: as we will show, it can 
be understood as a completely formal (graphical) syntax using the formalism of 
string diagrams [89], in a way similar to how other computational phenomena 
have been recently analysed e.g. in quantum theory (14], control theory [lB], 
and digital circuit theory [26]. 

It is intuitively clear how parametric lenses express aspects (I) and (II) above, 
whereas (III) will be achieved by studying them in a space of ‘differentiable 
objects’ (in a sense that will be made precise). The main technical contribution 
of our paper is showing how the various ingredients involved in learning (the 
model, the optimiser, the error map and the learning rate) can be uniformly 
understood as being built from parametric lenses. 

We will use category theory as the formal language to develop our notion of 
parametric lenses, and make Figure P] mathematically precise. The categorical 
perspective brings several advantages, which are well-known, established princi- 
ples in programming language semantics (3/40/49). Three of them are particularly 
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A P P 
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Learning 
Model Loss 
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Fig. 2: The parametric lens that captures the learning process informally sketched 
in Figure [I] Note each component is a lens itself, whose composition yields the 
interactions described in Figure |1| Defining this picture formally will be the 
subject of Sections 


important to our contribution, as they constitute distinctive advantages of our 
semantic foundations: 


Abstraction Our approach studies which categorical structures are sufficient 
to perform gradient-based learning. This analysis abstracts away from the 
standard case of neural networks in several different ways: as we will see, it 
encompasses other models (namely Boolean circuits), different kinds of op- 
timisers (including Adagrad, Adam, Nesterov momentum), and error maps 
(including quadratic and softmax cross entropy loss). These can be all un- 
derstood as parametric lenses, and different forms of learning result from 
their interaction. 

Uniformity As seen in Figure |1| learning involves ingredients that are seem- 
ingly quite different: a model, an optimiser, a loss map, etc. We will show 
how all these notions may be seen as instances of the categorical defini- 
tion of a parametric lens, thus yielding a remarkably uniform description of 
the learning process, and supporting our claim of parametric lenses being a 
fundamental semantic structure of learning. 

Compositionality The use of categorical structures to describe computation 
naturally enables compositional reasoning whereby complex systems are anal- 
ysed in terms of smaller, and hence easier to understand, components. Com- 
positionality is a fundamental tenet of programming language semantics; in 
the last few years, it has found application in the study of diverse kinds of 
computational models, across different fields— see e.g. [salsas]. As made 
evident by Figure [2] our approach models a neural network as a parametric 
lens, resulting from the composition of simpler parametric lenses, capturing 
the different ingredients involved in the learning process. Moreover, as all 
the simpler parametric lenses are themselves composable, one may engineer 
a different learning process by simply plugging a new lens on the left or right 
of existing ones. This means that one can glue together smaller and relatively 
simple networks to create larger and more sophisticated neural networks. 
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We now give a synopsis of our contributions: 


—In Section [2] we introduce the tools necessary to define our notion of para- 
metric lens. First, in Section [2.1] we introduce a notion of parametric cat- 
egories, which amounts to a functor Para(—) turning a category C into one 
Para(C) of ‘parametric C-maps’. Second, we recall lenses (Section |2.2). In a 
nutshell, a lens is a categorical morphism equipped with operations to view 
and update values in a certain data structure. Lenses play a prominent role 
in functional programming (47], as well as in the foundations of database 
theory and more recently game theory 25]. Considering lenses in C sim- 
ply amounts to the application of a functorial construction Lens(—), yield- 
ing Lens(C). Finally, we recall the notion of a cartesian reverse differential 
category (CRDC): a categorical structure axiomatising the notion of differ- 
entiation (Section [2.4p. We wrap up in Section by combining these 
ingredients into the notion of parametric lens, formally defined as a morphism 
in Para(Lens(C)) for a CRDC C. In terms of our desiderata (I)-(III) above, 
note that Para(—) accounts for (I), Lens(—) accounts for (II), and the CRDC 
structure accounts for (III). 

— As seen in Figure [|1| in the learning process there are many components at 
work: the model, the optimiser, the loss map, the learning rate, etc.. In Sec- 
tion [3| we show how the notion of parametric lens provides a uniform char- 
acterisation for such components. Moreover, for each of them, we show how 
different variations appearing in the literature become instances of our ab- 
stract characterisation. The plan is as follows: 

o In Section 3.1] we show how the combinatorial model subject of the training 
can be seen as a parametric lens. The conditions we provide are met by the 
‘standard’ case of neural networks, but also enables the study of learning for 
other classes of models. In particular, another instance are Boolean circuits: 
learning of these structures is relevant to binarisation and it has been 
explored recently using a categorical approach (50), which turns out to be 
a particular case of our framework. 

o In Section[3.2] we show how the loss maps associated with training are also 
parametric lenses. Our approach covers the cases of quadratic error, Boolean 
error, Softmax cross entropy, but also the ‘dot product loss’ associated with 
the phenomenon of deep dreaming (19][34][35)[44]. 

o In Section we model the learning rate as a parametric lens. This 
analysis also allows us to contrast how learning rate is handled in the ‘real- 
valued’ case of neural networks with respect to the ‘Boolean-valued’ case of 
Boolean circuits. 

o In Section [3.4] we show how optimisers can be modelled as ‘reparame- 
terisations’ of models as parametric lenses. As case studies, in addition to 
basic gradient update, we consider the stateful variants: Momentum [87], 
Nesterov Momentum (48), Adagrad (20), and Adam (Adaptive Moment Es- 
timation) . Also, on Boolean circuits, we show how the reverse derivative 
ascent of can be also regarded in such way. 

— In Section |4} we study how the composition of the lenses defined in Section [3] 
yields a description of different kinds of learning processes. 
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o Section [4 T]is dedicated to modelling supervised learning of parameters, 
in the way described in Figure |1| This amounts essentially to study of 
the composite of lenses expressed in Figure |2| for different choices of the 
various components. In particular we look at (i) quadratic loss with basic 
gradient descent, (ii) softmax cross entropy loss with basic gradient descent, 
(iii) quadratic loss with Nesterov momentum, and (iv) learning in Boolean 
circuits with XOR loss and basic gradient ascent. 

o In order to showcase the flexibility of our approach, in Section [4.2] we de- 
part from our ‘core’ case study of parameter learning, and turn attention 
to supervised learning of inputs, also called deep dreaming — the idea 
behind this technique is that, instead of the network parameters, one up- 
dates the inputs, in order to elicit a particular interpretation (19][34]/35]/44]. 
Deep dreaming can be easily expressed within our approach, with a differ- 
ent rearrangement of the parametric lenses involved in the learning process, 
see below. The abstract viewpoint of categorical semantics provides a 
mathematically precise and visually captivating description of the differ- 
ences between the usual parameter learning process and deep dreaming. 

—In Section [5] we describe a proof-of-concept Python implementation, avail- 
able at (17], based on the theory developed in this paper. This code is intended 
to show more concretely the payoff of our approach. Model architectures, as 
well as the various components participating in the learning process, are now 
expressed in a uniform, principled mathematical language, in terms of lenses. 

As a result, computing network gradients is greatly simplified, as it amounts 

to lens composition. Moreover, the modularity of this approach allows one to 

more easily tune the various parameters of training. 

We show our library via a number of experiments, and prove correctness by 

achieving accuracy on par with an equivalent model in Keras, a mainstream 

deep learning framework E]. In particular, we create a working non-trivial 

neural network model for the MNIST image-classification problem [33]. 

— Finally, in Sections [6] and [7] we discuss related and future work. 


2 Categorical Toolkit 


In this section we describe the three categorical components of our framework, 
each corresponding to an aspect of gradient-based learning: (I) the Para con- 
struction (Section B1), which builds a category of parametric maps, (II) the 
Lens construction, which builds a category of “bidirectional” maps (Section 
2.2), and (III) the combination of these two constructions into the notion of 
“parametric lenses” (Section 2-3). Finally (IV) we recall Cartesian reverse dif- 
ferential categories — categories equipped with an abstract gradient operator. 


Notation We shall use f;g for sequential composition of morphisms f: A > B 
and g: B > C in a category, 14 for the identity morphism on A, and J for the 
unit object of a symmetric monoidal category. 
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2.1 Parametric Maps 


In supervised learning one is typically interested in approximating a function 
g : R” > R” for some n and m. To do this, one begins by building a neural 
network, which is a smooth map f : R? x R” — R” where R? is the set of 
possible weights of that neural network. Then one looks for a value of q € R? 
such that the function f(q,—) : R” + R™ closely approximates g. We formalise 
these maps categorically via the Para construction (9}[23][24][30). 


Definition 1 (Parametric category). Let (C,@,I) be a strict] symmetric 
monoidal category. We define a category Para(C) with objects those of C, and 
a map from A to B a pair (P, f), with P an object of C and f : P8 A > 
B. The composite of maps (P, f) : A > B and (P', f’) : B > C is the pair 
(P' 8 P, (1p & f); f’). The identity on A is the pair (I, 14). 


Example 1. Take the category Smooth whose objects are natural numbers and 
whose morphisms f : n —> m are smooth maps from R” to R™. As described 
above, the category Para(Smooth) can be thought of as a category of neural 
networks: a map in this category from n to m consists of a choice of p and a 
map f : R? xR” > R™ with R” representing the set of possible weights of the 
neural network. 


As we will see in the next sections, the interplay of the various components 
at work in the learning process becomes much clearer once represented the mor- 
phisms of Para(C) using the pictorial formalism of string diagrams, which we 
now recall. In fact, we will mildly massage the traditional notation for string 
diagrams (below left), by representing a morphism f: A —> B in Para(C) as 
below right. 


P 
| 


P 
DISE A — f B 
A 


This is to emphasise the special role played by P, reflecting the fact that in 
machine learning data and parameters have different semantics. String diagram- 
matic notations also allows to neatly represent composition of maps (P, f) : A > 
B and (P', f’) : B > C (below left), and “reparameterisation” of (P, f) : A > B 
by a map a: Q > P (below right), yielding a new map (Q, (a814); f): A> B. 


Q 
P J Qa 
(2) 
Le d ? 
A f f C A f B 


* One can also define Para(C) in the case when C is non-strict; however, the result 
would be not a category but a bicategory. 
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Intuitively, reparameterisation changes the parameter space of (P, f): A > B to 
some other object Q, via some map a: Q —> P. We shall see later that gradient 
descent and its many variants can naturally be viewed as reparameterisations. 

Note coherence rules in combining the two operations in just work as ex- 
pected, as these diagrams can be ultimately ‘compiled’ down to string diagrams 
for monoidal categories. 


2.2 Lenses 


In machine learning (or even learning in general) it is fundamental that infor- 
mation flows both forwards and backwards: the ‘forward’ flow corresponds to a 
model’s predictions, and the ‘backwards’ flow to corrections to the model. The 
category of lenses is the ideal setting to capture this type of structure, as it is a 
category consisting of maps with both a “forward” and a “backward” part. 


Definition 2. For any Cartesian category C, the category of (bimorphic) lenses 
in C, Lens(C), is the category with the following data. Objects are pairs (A, A’) 
of objects in C. A map from (A, A’) to (B, B’) consists of a pair (f, f*) where 
f: A — B (called the get or forward part of the lens) and f* : Ax B! > 
A’ (called the put or backwards part of the lens). The composite of (f, f*) : 
(A, A’) > (B, B’) and (g,9*) : (B, B’) > (C,C’) is given by get f;g and put 
(To, (To; f, n1); 9"); f*. The identity on (A, A’) is the pair (14,771). 


The embedding of Lens(C) into the category of Tambara modules over C 
(see [7| Thm. 23]) provides a rich string diagrammatic language, in which lenses 
may be represented with forward/backward wires indicating the information 
flow. In this language, a morphism (f, f*) : (A, A’) > (B, B’) is written as 
below left, which can be ‘expanded’ as below right. 


Pom ae l — 


A’ f 


It is clear in this language how to describe the composite of (f, f*) : (A, A’) > 
(B, B’) and (g, 9") : (B, B’) > (C, 0"): 


——_) 
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2.3 Parametric Lenses 


The fundamental category where supervised learning takes place is the composite 
Para(Lens(C)) of the two constructions in the previous sections: 


Definition 3. The category Para(Lens(C)) of parametric lenses on C has 
as objects pairs (A, A’) of objects from C. A morphism from (A, A’) to (B, B’), 
called a parametric lenf] is a choice of parameter pair (P, P') and a lens (f, f*) : 
(P, P’) x (A, A’) > (B, B’) so that f : Px A > B and f*: Px Ax B' > P'xA' 


String diagrams for parametric lenses are built by simply composing the graph- 
ical languages of the previous two sections — see (i), where respectively a mor- 
phism, a composition of morphisms, and a reparameterisation are depicted. 

Given a generic morphism in Para(Lens(C)) as depicted in on the left, 
one can see how it is possible to “learn” new values from f: it takes as input an 
input A, a parameter P, and a change B’, and outputs a change in A, a value 
of B, and a change P’. This last element is the key component for supervised 
learning: intuitively, it says how to change the parameter values to get the neural 
network closer to the true value of the desired function. 

The question, then, is how one is to define such a parametric lens given 
nothing more than a neural network, ie., a parametric map (P, f): A > B. 
This is precisely what the gradient operation provides, and its generalization to 
categories is explored in the next subsection. 


2.4 Cartesian Reverse Differential Categories 


Fundamental to all types of gradient-based learning is, of course, the gradient 
operation. In most cases this gradient operation is performed in the category of 
smooth maps between Euclidean spaces. However, recent work has shown 
that gradient-based learning can also work well in other categories; for example, 
in a category of boolean circuits. Thus, to encompass these examples in a single 
framework, we will work in a category with an abstract gradient operation. 


Definition 4. A Cartesian left additive category Defn. 1] consists of 
a category C with chosen finite products (including a terminal object), and an 
addition operation and zero morphism in each homset, satisfying various axioms. 
A Cartesian reverse differential category (CRDC) Defn. 13] consists 
of a Cartesian left additive category C, together with an operation which provides, 
for each map f : A> B in C, a map Rif]: Ax B > A satisfying various 
axioms. 


For f : A > B, the pair (f, R[f]) forms a lens from (A, A) to (B, B). We 
will pursue the idea that R[f] acts as backwards map, thus giving a means to 
“learn” f. 


5 In [23], these are called learners. However, in this paper we study them in a much 
broader light; see Section [6] 
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Note that assigning type Ax B — A to R[f] hides some relevant information: 
B-values in the domain and A-values in the codomain of R[f] do not play the 
same role as values of the same types in f: A > B: in R[f], they really take in a 
tangent vector at B and output a tangent vector at A (cf. the definition of R[f] 
in Smooth, Example [2] below). To emphasise this, we will type R[f] as a map 
A x B’ — A’ (even though in reality A = A’ and B = B’), thus meaning that 
(f, R[f]) is actually a lens from (A, A’) to (B, B’). This typing distinction will 
be helpful later on, when we want to add additional components to our learning 
algorithms. 

The following two examples of CRDCs will serve as the basis for the learning 
scenarios of the upcoming sections. 


Example 2. The category Smooth (Example [1) is Cartesian with product given 
by addition, and it is also a Cartesian reverse differential category: given a 
smooth map f : R” — R™, the map R[f]: R” x R™ — R” sends a pair (a, v) 
to J[f]T (x) - v: the transpose of the Jacobian of f at x in the direction v. For 
example, if f : R? > R? is defined as f (a1, x2) := (a3 + 2x122, £2, sin(x1)), then 


2 U1 
R[f] : R? x R? — R? is given by (z,v) > he ; gosi) v2 |. Using 
v3 


the reverse derivative (as opposed to the forward derivative) is well-known to be 
much more computationally efficient for functions f : R” => R™ when m < n 
(for example, see [28)), as is the case in most supervised learning situations 
(where often m = 1). 


Example 3. Another CRDC is the symmetric monoidal category POLYz, 
Example 14] with objects the natural numbers and morphisms f: A — B the B- 
tuples of polynomials Zə|xı ... xa]. When presented by generators and relations 
these morphisms can be viewed as a syntax for boolean circuits, with parametric 
lenses for such circuits (and their reverse derivative) described in [50]. 


3 Components of learning as Parametric Lenses 


As seen in the introduction, in the learning process there are many components 
at work: a model, an optimiser, a loss map, a learning rate, etc. In this section 
we show how each such component can be understood as a parametric lens. 
Moreover, for each component, we show how our framework encompasses several 
variations of the gradient-descent algorithms, thus offering a unifying perspective 
on many different approaches that appear in the literature. 


3.1 Models as Parametric Lenses 


We begin by characterising the models used for training as parametric lenses. 
In essence, our approach identifies a set of abstract requirements necessary to 
perform training by gradient descent, which covers the case studies that we will 
consider in the next sections. 
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The leading intuition is that a suitable model is a parametric map, equipped 
with a reverse derivative operator. Using the formal developments of Section 
this amounts to assuming that a model is a morphism in Para(C), for a CRDC 
C. In order to visualise such morphism as a parametric lens, it then suffices to 
apply under Para(—) the canonical morphism R: C —> Lens(C) (which exists 
for any CRDC C, see Prop. 31]), mapping f to (f, R[f]). This yields a functor 
Para(R) : Para(C) > Para(Lens(C)), pictorially defined as 


A =f B = ) (4) 
a SS] Rif] 


B' 


Example 4 (Neural networks). As noted previously, to learn a function of type 
R” > R”, one constructs a neural network, which can be seen as a function of 
type R?” x R” > R™ where R? is the space of parameters of the neural network. 
As seen in Example |1| this is a map in the category Para(Smooth) of type 
R” — R” with parameter space R”. Then one can apply the functor in 
to present a neural network together with its reverse derivative operator as a 
parametric lens, i.e. a morphism in Para(Lens(Smooth)). 


Example 5 (Boolean circuits). For learning of Boolean circuits as described in 
[50], the recipe is the same as in Example |4| except that the base category is 
POLYz, (see Example [8). The important observation here is that POLYz, is a 
CRDC, see [13|{50), and thus we can apply the functor in (4). 


Note a model/parametric lens f can take as inputs an element of A, an 
element of B’ (a change in B) and a parameter P and outputs an element of 
B, a change in A, and a change in P. This is not yet sufficient to do machine 
learning! When we perform learning, we want to input a parameter P and a pair 
Ax B and receive a new parameter P. Instead, f expects a change in B (not an 
element of B) and outputs a change in P (not an element of P). Deep dreaming, 
on the other hand, wants to return an element of A (not a change in A). Thus, to 
do machine learning (or deep dreaming) we need to add additional components 
to f; we will consider these additional components in the next sections. 


3.2 Loss Maps as Parametric Lenses 


Another key component of any learning algorithm is the choice of loss map. 
This gives a measurement of how far the current output of the model is from 
the desired output. In standard learning in Smooth, this loss map is viewed as 
a map of type B x B > R. However, in our setup, this is naturally viewed as a 
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parametric map from B to R with parameter space BE We also generalize the 
codomain to an arbitrary object L. 


Definition 5. A loss map on B consists of a parametric map (B, loss) : 
Para(C)(B, L) for some object L. 


Note that we can precompose a loss map (B,loss): B > L with a neural 
network (P, f): A —> B (below left), and apply the functor in (with C = 
Smooth) to obtain the parametric lens below right. 


P P' BB 
P B |_| |] 
| B | = A f B loss L (5) 
A f loss }— L A’ Rif] A R{loss] L 


This is getting closer to the parametric lens we want: it can now receive 
inputs of type B. However, this is at the cost of now needing an input to L’; we 
consider how to handle this in the next section. 


Example 6 (Quadratic error). In Smooth, the standard loss function on R? is 
quadratic error: it uses L = R and has parametric map e : R? x Rè? > R given 
by e(b:, bp) = 5 5 (Ge): —(b:):)?, where we think of b; as the “true” value and 
bp the predicted value. This has reverse derivative R{e] : R? x R” x R — R?” x R’ 
given by R[e](b:, bp,a) = a- (bp — bt, bt — bp) — note a suggests the idea of 
learning rate, which we will explore in Section [3.3] 


Example 7 (Boolean error). In POLYz,, the loss function on Z? which is im- 
plicitly used in is a bit different: it uses L = Z? and has parametric map 
e : Z? x Z + Z given by 

(bz, bp) = bi + bp. 


(Note that this is + in Z2; equivalently this is given by XOR.) Its reverse deriva- 
tive is of type R[e] : Z? x Z? x Z? > Z? x Z? given by R[e](b:, bp, a) = (a, a). 


Example 8 (Softmax cross entropy). The Softmax cross entropy loss is a R?- 
parametric map R° — R defined by e(b:, bp) = X2? (b:)i((bp)i—log(Softmax(b,);)) 


exp((bp ):) 


where Softmax(b,) = EEL, exp((bp);) 


is defined componentwise for each class i. 
We note that, although bg needs to be a probability distribution, at the 
moment there is no need to ponder the question of interaction of probability 


distributions with the reverse derivative framework: one can simply consider b; 
as the image of some logits under the Softmax function. 


6 Here the loss map has its parameter space equal to its input space. However, putting 
loss maps on the same footing as models lends itself to further generalizations where 
the parameter space is different, and where the loss map can itself be learned. See 
Generative Adversarial Networks, p Figure 7.]. 


Categorical Foundations of Gradient-Based Learning 13 


Example 9 (Dot product). In Deep Dreaming (Section|4.2) we often want to focus 
only on a particular element of the network output R’. This is done by supplying 
a one-hot vector b; as the ground truth to the loss function e(b;, bp) = b¢-bp which 
computes the dot product of two vectors. If the ground truth vector y is a one- 
hot vector (active at the i-th element), then the dot product performs masking of 
all inputs except the i-th one. Note the reverse derivative R[e]: R’ x R’ x R > 
R’ x R’ of the dot product is defined as R[e](b:, bp, œ) = (a+ bp, a be). 


3.3 Learning Rates as Parametric Lenses 


After models and loss maps, another ingredient of the learning process are learn- 
ing rates, which we formalise as follows. 


Definition 6. A learning rate a on L consists of a lens from (L, L’) to (1,1) 
where 1 is a terminal object in C. 


Note that the get component of the learning rate lens must be the unique map 
to 1, while the put component is a map L x 1 — L’; that is, simply a map 
a* : L = L’. Thus we can view a as a parametric lens from (L, L’) > (1,1) 
(with trivial parameter space) and compose it in Para(Lens(C)) with a model 
and a loss map (cf. (5)) to get 


P P' B B' 

bole e e 
A f loss A (6) 
A’ Rif] A Riloss] ii 


Example 10. In standard supervised learning in Smooth, one fixes some e€ > 0 
as a learning rate, and this is used to define a: a is simply constantly —e, ie., 
a(l) = —e for any lL E L. 


Example 11. In supervised learning in POLYz,, the standard learning rate is 
quite different: for a given L it is defined as the identity function, a(l) = L. 


Other learning rate morphisms are possible as well: for example, one could 
fix some € > 0 and define a learning rate in Smooth by a(l) = —e - l. Such a 
choice would take into account how far away the network is from its desired goal 
and adjust the learning rate accordingly. 


3.4 Optimisers as Reparameterisations 


In this section we consider how to implement gradient descent (and its variants) 
into our framework. To this aim, note that the parametric lens (f, R[f]) rep- 
resenting our model (see (4)) outputs a P’, which represents a change in the 
parameter space. Now, we would like to receive not just the requested change 
in the parameter, but the new parameter itself. This is precisely what gradient 
descent accomplishes, when formalised as a lens. 
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Definition 7. In any CRDC C we can define gradient update as a map G in 
Lens(C) from (P,P) to (P, P’) consisting of (G,G*) : (P, P) > (P, P’), where 
G(p) = p and G*(p,p’) = p + vf] 


Intuitively, such a lens allows one to receive the requested change in parameter 
and implement that change by adding that value to the current parameter. By its 
type, we can now “plug” the gradient descent lens G: (P, P) > (P, P’) above the 
model (f, R[f]) in — formally, this is accomplished as a reparameterisation 
of the parametric morphism (f, R[f]), cf. Section This gives us Figure 
(left). 


P P SXP SXP 
e Optimiser 
P P’ P P' 
A B A B 
i > 
Model Model 
P < k 
A’ B’ A’ B’ 


Fig.3: Model reparameterised by basic gradient descent (left) and a generic 
stateful optimiser (right). 


Example 12 (Gradient update in Smooth). In Smooth, the gradient descent repa- 
rameterisation will take the output from P’ and add it to the current value of 
P to get a new value of P. 


Example 13 (Gradient update in Boolean circuits). In the CRDC POLYz,, the 
gradient descent reparameterisation will again take the output from P’ and 
add it to the current value of P to get a new value of P; however, since + in 
Zə is the same as XOR, this can be also be seen as taking the XOR. of the 
current parameter and the requested change; this is exactly how this algorithm 
is implemented in [50]. 


Other variants of gradient descent also fit naturally into this framework by 
allowing for additional input/output data with P. In particular, many of them 
keep track of the history of previous updates and use that to inform the next one. 
This is easy to model in our setup: instead of asking for a lens (P, P) > (P, P’), 
we ask instead for a lens (S x P, S x P) > (P, P’) where S is some “state” object. 


T Note that as in the discussion in Section|2.4| we are implicitly assuming that P = P’; 
we have merely notated them differently to emphasize the different “roles” they play 
(the first P can be thought of as “points”, the second as “vectors” ) 
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Definition 8. A stateful parameter update consists of a choice of object S 
(the state object) and a lens U : (S x P,S x P) > (P, P’). 


Again, we view this optimiser as a reparameterisation which may be “plugged 
in” a model as in Figure |3| (right). Let us now consider how several well-known 
optimisers can be implemented in this way. 


Example 14 (Momentum). In the momentum variant of gradient descent, one 
keeps track of the previous change and uses this to inform how the current 
parameter should be changed. Thus, in this case, we set S = P, fix some y > 
0, and define the momentum lens (U,U*) : (P x P,P x P) > (P,P’) . by 
U(s,p) = p and U*(s,p,p’) = (s’,p +s’), where s = —ys +p’. Note momentum 
recovers gradient descent when y = 0. 


In both standard gradient descent and momentum, our lens representation 
has trivial get part. However, as soon as we move to more complicated variants, 
this is not anymore the case, as for instance in Nesterov momentum below. 


Example 15 (Nesterov momentum). In Nesterov momentum, one uses the mo- 
mentum from previous updates to tweak the input parameter supplied to the 
network. We can precisely capture this by using a small variation of the lens in 
the previous example. Again, we set S = P, fix some y > 0, and define the Nes- 
terov momentum lens (U, U*) : (P x P,P x P) > (P, P’) by U(s,p) = p+ ys 
and U* as in the previous example. 


Example 16 (Adagrad). Given any fixed e > 0 and 6 ~ 1077, Adagrad is 
given by S = P, with the lens whose get part is (g, p) +> p. The put is (g, p, p’) = 
(g, pt ie © p’) where g' = g+p' ©p’ and © is the elementwise (Hadamard) 
product. Unlike with other optimization algorithms where the learning rate is 
the same for all parameters, Adagrad divides the learning rate of each individual 


parameter with the square root of the past accumulated gradients. 


Example 17 (Adam). Adaptive Moment Estimation (Adam) is another method 
that computes adaptive learning rates for each parameter by storing exponen- 
tially decaying average of past gradients (m) and past squared gradients (v). For 
fixed 31, 82 € [0,1), € > 0, and 6 ~ 1078, Adam is given by S = P x P, with 
the lens whose get part is (m, v, p) > p and whose put part is put(m, v, p, p’) = 
(RO, p+ <= © RY) where m = fim + (L— Bi)p’, v = bav + (1 — Bo)p®, 


m N- v 
IEY = TBE: 


and M = 


Note that, so far, optimsers/reparameterisations have been added to the 
P/P’ wires. In order to change the model’s parameters (Fig. B). In Section 
we will study them on the A/A’ wires instead, giving deep dreaming. 


4 Learning with Parametric Lenses 


In the previous section we have seen how all the components of learning can be 
modeled as parametric lenses. We now study how all these components can be 
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put together to form supervised learning systems. In addition to studying the 
most common examples of supervised learning: systems that learn parameters, 
we also study different kinds systems: those that learn their inputs. This is a 
technique commonly known as deep dreaming, and we present it as a natural 
counterpart of supervised learning of parameters. 


Before we describe these systems, it will be convenient to represent all the 
inputs and outputs of our parametric lenses as parameters. In (6), we see the 
P/P’ and B/B’ inputs and outputs as parameters; however, the A/A’ wires are 
not. To view the A/A’ inputs as parameters, we compose that system with the 
parametric lens 7 we now define. The parametric lens 7 has the type (1,1) > 
(A, A’) with parameter space (A, A’) defined by (gety = 14, put, = 71) and can 

A 


A 
be depicted graphically as . Composing 7 with the rest of the learning 


A’ 
system in (6 gives us the closed parametric lens 


A A’ P P' B B' 
pel a y 
Model Loss a (7) 


A’ B’ I 


This composite is now a map in Para(Lens(C)) from (1, 1) to (1, 1); all its inputs 
and outputs are now vertical wires, ie., parameters. Unpacking it further, this is 
a lens of type (A x P x B, A’ x P’ x B’) > (1,1) whose get map is the terminal 
map, and whose put map is of the type A x P x B > A’ x P’ x B’. It can be 
unpacked as the composite put(a, p, b+) = (a’, p’, b.), where 


bp = f(p,a) (b,b) = Riloss](b,, bp, a(loss(b:, bp))) — (p',a') = RIF] (p, a, bp). 


In the next two sections we consider further additions to the image above which 
correspond to different types of supervised learning. 


4.1 Supervised Learning of Parameters 


The most common type of learning performed on is supervised learning of 
parameters. This is done by reparameterising (cf. Section [2.1) the image in the 
following manner. The parameter ports are reparameterised by one of the (pos- 
sibly stateful) optimisers described in the previous section, while the backward 
wires A’ of inputs and B’ of outputs are discarded. This finally yields the com- 
plete picture of a system which learns the parameters in a supervised manner: 
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A SxP SRP 


B 
Optimiser 
e| p | 
B’ 
A B L 
Model Loss a 
eoa 

A’ B’ L’ 


Fixing a particular optimiser (U,U*) : (S x P,S x P) > (P, P’) we again 
unpack the entire construction. This is a map in Para(Lens(C)) from (1,1) to 
(1,1) whose parameter space is (A x S x P x B,S x P). In other words, this 
is a lens of type (A x S x P x B,S x P) — (1,1) whose get component is the 
terminal map. Its put map has the type A x S x P x B — S x P and unpacks 
to put(a, s, p, bt) = U*(s,p,p’), where 


p= U(s,p) bp = f (p,a) 
(bi, bp) = Ri{loss] (b+, bp, a(loss(b;, bp))) (p', a’) = Rif] (D, a, bi). 


While this formulation might seem daunting, we note that it just explicitly 
specifies the computation performed by a supervised learning system. The vari- 
able p represents the parameter supplied to the network by the stateful gradient 
update rule (in many cases this is equal to p); bp represents the prediction of 
the network (contrast this with b; which represents the ground truth from the 
dataset). Variables with a tick ” represent changes: bf, and b, are the changes 
on predictions and true values respectively, while p’ and a’ are changes on the 
parameters and inputs. Furthermore, this arises automatically out of the rule for 
lens composition (3); what we needed to specify is just the lenses themselves. 

We justify and illustrate our approach on a series of case studies drawn from 
the literature. This presentation has the advantage of treating all these instances 
uniformly in terms of basic constructs, highlighting their similarities and differ- 
ences. First, we fix some parametric map (R?, f) : Para(Smooth)(R*,R°) in 
Smooth and the constant negative learning rate a : R (Example |10). We then 
vary the loss function and the gradient update, seeing how the put map above 
reduces to many of the known cases in the literature. 


Example 18 (Quadratic error, basic gradient descent). Fix the quadratic error 
(Example [6) as the loss map and basic gradient update (Example [12). Then the 
aforementioned put map simplifies. Since there is no state, its type reduces to 
Ax Px B- P, and we have put(a, p, bt) = p+p’, where (p’,a’) = Ri f](p,a,a- 
(f(p, a) — b4)). Note that a here is simply a constant, and due to the linearity 
of the reverse derivative (Def Bp, we can slide the a from the costate into the 
basic gradient update lens. Rewriting this update, and performing this sliding we 
obtain a closed form update step put(a, p, b+) = p+a (R| f] (p,a, f(p,a)— bt); To), 


18 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi 


where the negative descent component of gradient descent is here contained in 
the choice of the negative constant a. 


This example gives us a variety of regression algorithms solved iteratively 
by gradient descent: it embeds some parametric map (R?, f): R° — R? into the 
system which performs regression on input data - where a denotes the input to 
the model and b; denotes the ground truth. If the corresponding f is linear and 
b = 1, we recover simple linear regression with gradient descent. If the codomain 
is multi-dimensional, i.e. we are predicting multiple scalars, then we recover 
multivariate linear regression. Likewise, we can model a multi-layer perceptron or 
even more complex neural network architectures performing supervised learning 
of parameters simply by changing the underlying parametric map. 


Example 19 (Softmaz cross entropy, basic gradient descent). Fix Softmax cross 
entropy (Example [8}) as the loss map and basic gradient update (Example 03}. 
Again the put map simplifies. The type reduces to A x P x B — P and we have 
put(a, p, b+) = p + p' where (p’,a’) = R| f| (p,a, a (Softmax(f(p,a)) — b+)). The 
same rewriting performed on the previous example can be done here. 


This example recovers logistic regression, e.g. classification. 


Example 20 (Mean squared error, Nesterov Momentum). Fix the quadratic error 
(Example (6) as the loss map and Nesterov momentum (Example |15) as the 
gradient update. This time the put map A x S x Px B + S x P does not have a 
simplified type. The implementation of put reduces to put(a, s, p, bt) = (s’, p+ s"), 
where P=p+ 7s, (p', a’) = Rif|(@,4, a: (J, a) = bt)), and s! = eae +p. 


This example with Nesterov momentum differs in two key points from all 
the other ones: i) the optimiser is stateful, and ii) its get map is not trivial. 
While many other optimisers are stateful, the non-triviality of the get map here 
showcases the importance of lenses. They allow us to make precise the notion of 
computing a “lookahead” value for Nesterov momentum, something that is in 
practice usually handled in ad-hoc ways. Here, the algebra of lens composition 
handles this case naturally by using the get map, a seemingly trivial, unused 
piece of data for previous optimisers. 

Our last example, using a different base category POLY z,, shows that our 
framework captures learning in not just continuous, but discrete settings too. 
Again, we fix a parametric map (Z?, f) : POLY z, (Z°, Z’) but this time we fix 
the identity learning rate (Example m}, instead of a constant one. 


Example 21 (Basic learning in Boolean circuits). Fix XOR as the loss map (Ex- 
ample and the basic gradient update (Example [E3}. The put map again 
simplifies. The type reduces to A x P x B — P and the implementation to 
put(a, p, b+) = p + p' where (p',a') = R[f](p, a, f (p, a) + be). 


A sketch of learning iteration. Having described a number of examples in 
supervised learning, we outline how to model learning iteration in our framework. 
Recall the aforementioned put map whose type is A x P x B > P (for simplicity 
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here modelled without state S). This map takes an input-output pair (ao, bo), 
the current parameter p; and produces an updated parameter p;+1. At the next 
time step, it takes a potentially different input-output pair (a1, b1), the updated 
parameter p;+ı and produces p;+2. This process is then repeated. We can model 
this iteration as a composition of the put map with itself, as a composite (A x 
put x B); put whose type is A x Ax P x B x B > P. This map takes two input- 
output pairs A x B, a parameter and produces a new parameter by processing 
these datapoints in sequence. One can see how this process can be iterated any 
number of times, and even represented as a string diagram. 

But we note that with a slight reformulation of the put map, it is possible 
to obtain a conceptually much simpler definition. The key insight lies in seeing 
that the map put : Ax P x B > P is essentially an endo-map P — P with some 
extra inputs A x B; it’s a parametric map! 

In other words, we can recast the put map as a parametric map (A x B, put) : 
Para(C)(P, P). Being an endo-map, it can be composed with itself. The resulting 
composite is an endo-map taking two “parameters”: input-output pair at the 
time step 0 and time step 1. This process can then be repeated, with Para 
composition automatically taking care of the algebra of iteration. 


AxB AxB AxB 
P put £ put > n, put > P 


This reformulation captures the essence of parameter iteration: one can think 
of it as a trajectory p;,pi+1, Pi+2,... through the parameter space; but it is a 
trajectory parameterised by the dataset. With different datasets the algorithm 
will take a different path through this space and learn different things. 


4.2 Deep Dreaming: Supervised Learning of Inputs 


We have seen that reparameterising the parameter port with gradient descent 
allows us to capture supervised parameter learning. In this section we describe 
how reparameterising the input port provides us with a way to enhance an input 
image to elicit a particular interpretation. This is the idea behind the technique 


called Deep Dreaming, appearing in the literature in many forms (19][341/35]/44). 


SxA SxA P B 


a 


Optimiser 


Ladt, 


Model Loss a (8) 


B’ L’ 
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Deep dreaming is a technique which uses the parameters p of some trained 
classifier network to iteratively dream up, or amplify some features of a class b on 
a chosen input a. For example, if we start with an image of a landscape ao, a label 
b of a “cat” and a parameter p of a sufficiently well-trained classifier, we can start 
performing “learning” as usual: computing the predicted class for the landscape 
ao for the network with parameters p, and then computing the distance between 
the prediction and our label of a cat b. When performing backpropagation, the 
respective changes computed for each layer tell us how the activations of that 
layer should have been changed to be more “cat” like. This includes the first 
(input) layer of the landscape ao. Usually, we discard this changes and apply 
gradient update to the parameters. In deep dreaming we discard the parameters 
and apply gradient update to the input (see (8)). Gradient update here takes these 
changes and computes a new image a; which is the same image of the landscape, 
but changed slightly so to look more like whatever the network thinks a cat looks 
like. This is the essence of deep dreaming, where iteration of this process allows 
networks to dream up features and shapes on a particular chosen image gl. 

Just like in the previous subsection, we can write this deep dreaming system 
as a map in Para(Lens(C)) from (1, 1) to (1, 1) whose parameter space is (5x Ax 
Px B, Sx A). In other words, this is a lens of type (Sx Ax Px B, Sx A) > (1,1) 
whose get map is trivial. Its put map has the type Sx Ax Px Bo SxA 
and unpacks to put(s,a,p,b;) = U*(s,a,a’), where a = U(s,a), bp = f(p,@), 
(b, bp) = Rfloss] (b+, bp, a(loss(b:, by))), and (p', a’) = R[f](p, a, bp). 

We note that deep dreaming is usually presented without any loss function as 
a maximisation of a particular activation in the last layer of the network output 
Section 2.]. This maximisation is done with gradient ascent, as opposed to 
gradient descent. However, this is just a special case of our framework where 
the loss function is the dot product (Example p). The choice of the particular 
activation is encoded as a one-hot vector, and the loss function in that case 
essentially masks the network output, leaving active only the particular chosen 
activation. The final component is the gradient ascent: this is simply recovered 
by choosing a positive, instead of a negative learning rate [44]. We explicitly 
unpack this in the following example. 


Example 22 (Deep dreaming, dot product loss, basic gradient update). Fix Smooth 
as base category, a parametric map (R?, f) : Para(Smooth)(R*,R°), the dot 
product loss (Example ph, basic gradient update (Example [12}, and a positive 
learning rate a : R. Then the above put map simplifies. Since there is no state, its 
type reduces to A x P x B — A and its implementation to put(a, p, bi) = a +a’, 
where (p’,a’) = Ri f](p, a, a- bz). Like in Example [18] this update can be rewrit- 
ten as put(a, p, b+) = a + a - (R|f](p,a, bi); T1), making a few things apparent. 
This update does not depend on the prediction f (p,a): no matter what the net- 
work has predicted, the goal is always to maximize particular activations. Which 
activations? The ones chosen by b+. When 0; is a one-hot vector, this picks out 
the activation of just one class to maximize, which is often done in practice. 


While we present only the most basic image, there is plenty of room left 
for exploration. The work of Section 2.] adds an extra regularization term 
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to the image. In general, the neural network f is sometimes changed to copy 
a number of internal activations which are then exposed on the output layer. 
Maximizing all these activations often produces more visually appealing results. 
In the literature we did not find an example which uses the Softmax-cross entropy 
(Example as a loss function in deep dreaming, which seems like the more 
natural choice in this setting. Furthermore, while deep dreaming commonly uses 
basic gradient descent, there is nothing preventing the use of any of the optimiser 
lenses discussed in the previous section, or even doing deep dreaming in the 
context of Boolean circuits. Lastly, learning iteration which was described in at 
the end of previous subsection can be modelled here in an analogous way. 


5 Implementation 


We provide a proof-of-concept implementation as a Python library — full usage 
examples, source code, and experiments can be found at |17|. We demonstrate 
the correctness of our library empirically using a number of experiments im- 
plemented both in our library and in Keras 01, a popular framework for deep 
learning. For example, one experiment is a model for the MNIST image clas- 
sification problem [83]: we implement the same model in both frameworks and 
achieve comparable accuracy. Note that despite similarities between the user in- 
terfaces of our library and of Keras, a model in our framework is constructed 
as a composition of parametric lenses. This is fundamentally different to the 
approach taken by Keras and other existing libraries, and highlights how our 
proposed algebraic structures naturally guide programming practice 

In summary, our implementation demonstrates the advantages of our ap- 
proach. Firstly, computing the gradients of the network is greatly simplified 
through the use of lens composition. Secondly, model architectures can be ex- 
pressed in a principled, mathematical language; as morphisms of a monoidal 
category. Finally, the modularity of our approach makes it easy to see how var- 
ious aspects of training can be modified: for example, one can define a new 
optimization algorithm simply by defining an appropriate lens. We now give a 
brief sketch of our implementation. 


5.1 Constructing a Model with Lens and Para 


We model a lens (f, f*) in our library with the Lens class, which consists of a 
pair of maps fwd and rev corresponding to f and f*, respectively. For example, 
we write the identity lens (14,72) as follows: 


identity = Lens (lambda x: x, lambda x_dy: x_dy[1]) 


The composition (in diagrammatic order) of Lens values f and g is written 
f >> g, and monoidal composition as f @ g. Similarly, the type of Para maps 
is modeled by the Para class, with composition and monoidal product written 
the same way. Our library provides several primitive Lens and Para values. 
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Let us now see how to construct a single layer neural network from the com- 
position of such primitives. Diagramatically, we wish to construct the following 
model, representing a single ‘dense’ layer of a neural network: 


R°x¢ Rox¢ R? R? 
LI aliy 
R linear bias activation R (9) 
$ R? R? aa 

Here, the parameters of linear are the coefficients of a b x a matrix, and the 
underlying lens has as its forward map the function (M, x) + M - x, where M is 
the b x a matrix whose coefficients are the R°** parameters, and x € R° is the 
input vector. The bias map is even simpler: the forward map of the underlying 
lens is simply pointwise addition of inputs and parameters: (b, x) > b+. Finally, 
the activation map simply applies a nonlinear function (e.g., sigmoid) to the 
input, and thus has the trivial (unit) parameter space. The representation of 
this composition in code is straightforward: we can simply compose the three 
primitive Para maps as in (9): 


def dense(a, b, activation): 
return linear(a, b) >> bias(b) >> activation 


Note that by constructing model architectures in this way, the computation 
of reverse derivatives is greatly simplified: we obtain the reverse derivative ‘for 
free’ as the put map of the model. Furthermore, adding new primitives is also 
simplified: the user need simply provide a function and its reverse derivative in 
the form of a Para map. Finally, notice also that our approach is truly composi- 
tional: we can define a hidden layer neural network with n hidden units simply 
by composing two dense layers, as follows: 


dense(a, n, activation) >> dense(n, b, activation) 


5.2 Learning 


Now that we have constructed a model, we also need to use it to learn from 
data. Concretely, we will construct a full parametric lens as in Figure P] then 
extract its put map to iterate over the dataset. 

By way of example, let us see how to construct the following parametric lens, 
representing basic gradient descent over a single layer neural network with a 
fixed learning rate: 


Las PMT 
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This morphism is constructed essentially as below, where apply_update(a, 
f) represents the ‘vertical stacking’ of a atop f: 


apply_update(basic_update, dense) >> loss >> learning_rate(e) 


Now, given the parametric lens of (10), one can construct a morphism step : 
Bx Px A — P which is simply the put map of the lens. Training the model then 
consists of iterating the step function over dataset examples (x, y) € Ax B to op- 
timise some initial choice of parameters 0) € P, by letting 0;41 = step(y;, 0i, £i). 

Note that our library also provides a utility function to construct step from 
its various pieces: 


step = supervised_step(model, update, loss, learning_rate) 


For an end-to-end example of model training and iteration, we refer the 
interested reader to the experiments accompanying the code (17). 


6 Related Work 


The work is closely related to ours, in that it provides an abstract categorical 
model of backpropagation. However, it differs in a number of key aspects. We 
give a complete lens-theoretic explanation of what is back-propagated via (i) 
the use of CRDCs to model gradients; and (ii) the Para construction to model 
parametric functions and parameter update. We thus can go well beyond 
in terms of examples - their example of smooth functions and basic gradient 
descent is covered in our subsection 

We also explain some of the constructions of in a more structured way. 
For example, rather than considering the category Learn of as primitive, 
here we construct it as a composite of two more basic constructions (the Para 
and Lens constructions). The flexibility could be used, for example, to com- 
positionally replace Para with a variant allowing parameters to come from a 
different category, or lenses with the category of optics enabling us to model 
things such as control flow using prisms. 

One more relevant aspect is functoriality. We use a functor to augment a 
parametric map with its backward pass, just like [23]. However, they additionally 
augmented this map with a loss map and gradient descent using a functor as 
well. This added extra conditions on the partial derivatives of the loss function: 
it needed to be invertible in the 2nd variable. This constraint was not justified 
in [23], nor is it a constraint that appears in machine learning practice. This led 
us to reexamine their constructions, coming up with our reformulation that does 
not require it. While loss maps and optimisers are mentioned in as parts of 
the aforementioned functor, here they are extracted out and play a key role: loss 
maps are parametric lenses and optimisers are reparameterisations. Thus, in this 
paper we instead use Para-composition to add the loss map to the model, and 
Para 2-cells to add optimisers. The mentioned inverse of the partial derivative 
of the loss map in the 2”4 variable was also hypothesised to be relevant to deep 
dreaming. We have investigated this possibility thoroughly in our paper, showing 
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it is gradient update which is used to dream up pictures. We also correct a small 
issue in Theorem III.2 of . There, the morphisms of Learn were defined up to 
an equivalence (pg. 4 of |23|) but, unfortunately, the functor defined in Theorem 
II.2 does not respect this equivalence relation. Our approach instead uses 2-cells 
which comes from the universal property of Para — a 2-cell from (P, f): A> B 
to (Q,g): A— B is a lens, and hence has two components: a map a: Q > P 
and a* : Q x P > Q. By comparison, we can see the equivalence relation of 
as being induced by map a: Q —> P, and not a lens. Our approach highlights 
the importance of the 2-categorical structure of learners. In addition, it does not 
treat the functor Para(C) — Learn as a primitive. In our case, this functor 
has the type Para(C) — Para(Lens(C)) and arises from applying Para to a 
canonical functor C + Lens(C) existing for any reverse derivative category, not 
just Smooth. Lastly, in our paper we took advantage of the graphical calculus 
for Para, redrawing many diagrams appearing in in a structured way. 

Other than [23], there are a few more relevant papers. The work of con- 
tains a sketch of some of the ideas this paper evolved from. They are based 
on the interplay of optics with parameterisation, albeit framed in the setting of 
diffeological spaces, and requiring cartesian and local cartesian closed structure 
on the base category. Lenses and Learners are studied in the eponymous work 
of which observes that learners are parametric lenses. They do not explore 
any of the relevant Para or CRDC structure, but make the distinction between 
symmetric and asymmetric lenses, studying how they are related to learners de- 
fined in 23]. A lens-like implementation of automatic differentiation is the focus 
of , but learning algorithms aren’t studied. A relationship between category- 
theoretic perspective on probabilistic modeling and gradient-based optimisation 
is studied in which also studies a variant of the Para construction. Usage of 
Cartesian differential categories to study learning is found in [46]. They extend 
the differential operator to work on stateful maps, but do not study lenses, pa- 
rameterisation nor update maps. The work of studies deep learning in the 
context of Cycle-consistent Generative Adversarial Networks and formalises 
it via free and quotient categories, making parallels to the categorical formula- 
tions of database theory [45]. They do use the Para construction, but do not 
relate it to lenses nor reverse derivative categories. A general survey of category 
theoretic approaches to machine learning, covering many of the above papers, 
can be found in [a3]. Lastly, the concept of parametric lenses has started appear- 
ing in recent formulations of categorical game theory and cybernetics {ol10]. The 
work of [o] generalises the study of parametric lenses into parametric optics and 
connects it to game thereotic concepts such as Nash equilibria. 


7 Conclusions and Future Directions 


We have given a categorical foundation of gradient-based learning algorithms 
which achieves a number of important goals. The foundation is principled and 
mathematically clean, based on the fundamental idea of a parametric lens. The 
foundation covers a wide variety of examples: different optimisers and loss maps 
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in gradient-based learning, different settings where gradient-based learning hap- 
pens (smooth functions vs. boolean circuits), and both learning of parameters 
and learning of inputs (deep dreaming). Finally, the foundation is more than 
a mere abstraction: we have also shown how it can be used to give a practical 
implementation of learning, as discussed in Section [5] 

There are a number of important directions which are possible to explore 
because of this work. One of the most exciting ones is the extension to more 
complex neural network architectures. Our formulation of the loss map as a 
parametric lens should pave the way for Generative Adversarial Networks 27], 
an exciting new architecture whose loss map can be said to be learned in tandem 
with the base network. In all our settings we have fixed an optimiser beforehand. 
The work of |4| describes a meta-learning approach which sees the optimiser as a 
neural network whose parameters and gradient update rule can be learned. This 
is an exciting prospect since one can model optimisers as parametric lenses; 
and our framework covers learning with parametric lenses. Recurrent neural 
networks are another example of a more complex architecture, which has already 
been studied in the context of differential categories in [46]. When it comes to 
architectures, future work includes modelling some classical systems as well, such 
as the Support Vector Machines (15), which should be possible with the usage 
of loss maps such as Hinge loss. 

Future work also includes using the full power of CRDC axioms. In particular, 
axioms RD.6 or RD.7, which deal with the behaviour of higher-order derivatives, 
were not exploited in our work, but they should play a role in modelling some 
supervised learning algorithms using higher-order derivatives (for example, the 
Hessian) for additional optimisations. Taking this idea in a different direction, 
one can see that much of our work can be applied to any functor of the form 
F :C + Lens(C) - F does not necessarily have to be of the form f +> (f, RIFI) 
for a CRDC R. Moreover, by working with more generalised forms of the lens 
category (such as dependent lenses), we may be able to capture ideas related 
to supervised learning on manifolds. And, of course, we can vary the parameter 
space to endow it with different structure from the functions we wish to learn. In 
this vein, we wish to use fibrations/dependent types to model the use of tangent 
bundles: this would foster the extension of the correct by construction paradigm 
to machine learning, and thereby addressing the widely acknowledged problem 
of trusted machine learning. The possibilities are made much easier by the com- 
positional nature of our framework. Another key topic for future work is to link 
gradient-based learning with game theory. At a high level, the former takes lit- 
tle incremental steps to achieve an equilibrium while the later aims to do so in 
one fell swoop. Formalising this intuition is possible with our lens-based frame- 
work and the lens-based framework for game theory 25]. Finally, because our 
framework is quite general, in future work we plan to consider further modifica- 
tions and additions to encompass non-supervised, probabilistic and non-gradient 
based learning. This includes genetic algorithms and reinforcement learning. 
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1 Introduction 


Probabilistic programming languages (PPLs) allow for encoding a wide range of 
statistical inference problems and provide inference algorithms as part of their 
implementations. Specifically, PPLs allow language users to focus solely on en- 
coding their statistical problems, which the language implementation then solves 
automatically. Many such languages exist and are applied in, e.g., statistics, ma- 
chine learning, and artificial intelligence. Some example PPLs are WebPPL [20], 
Birch [32], Anglican [40], and Pyro [10]. 

However, implementing efficient PPL inference algorithms is challenging for 
many real-world problems. Most often, universal PPLs implement general- 
purpose inference algorithms—most commonly sequential Monte Carlo (SMC) 
methods [14], Markov chain Monte Carlo (MCMC) methods [18], Hamiltonian 
Monte Carlo (HMC) methods [12], variational inference (VI) [39], or a combina- 
tion of these. In some cases, poor efficiency may be due to an inference algorithm 
not well suited to the particular PPL program. However, in other cases, the PPL 
implementations do not fully exploit opportunities for parallelization and opti- 
mization on the available hardware. Unfortunately, doing this is often tricky 
without introducing complexity for end-users of PPLs. 

A critical performance consideration is handling probabilistic checkpoints [37] 
in PPLs. Checkpoints are locations in probabilistic programs where inference al- 
gorithms must interject, for example, to resample in SMC inference or record 
random draw locations where MCMC inference can explore alternative execution 
paths. The most common approach to checkpoints—used in universal PPLs such 
as WebPPL [20], Anglican [40], and Birch [32]—is to associate them with PPL- 
specific language constructs. In general, PPL users can place these constructs 
without restriction, and inference algorithms interject through continuation- 
passing style (CPS) transformations [9,20,40] or non-preemptive multitasking 
[32] (e.g., coroutines) that enable pausing and resuming executions. These so- 
lutions are often not available in languages such as C and CUDA [1] used for 
high-performance platforms such as graphics processing units (GPUs), making 
compiling PPLs to these languages and platforms challenging. Some approaches 
for running PPLs on GPUs do exist, however. LibBi [29] runs on GPUs with 
SMC inference but is not universal. Stan [12] and AugurV2 [22] partially run 
MCMC inference on GPUs but have limited expressive power. Pyro [10] runs on 
GPUs, but currently not in combination with SMC. In this paper, we compile a 
universal PPL and run it with SMC on GPUs for the first time. 

A more straightforward approach to checkpoints, used for SMC in Birch [32] 
and Pyro [10], is to encode models with a step function called iteratively. Check- 
points then occur each time step returns. This paper presents a new approach to 
checkpoint handling, generalizing the step function approach. We write prob- 
abilistic programs as a set of code blocks connected in what we term a PPL 


6 A term due to Goodman et al. [19]. No precise definition exists, but in principle, a 
universal PPL program can perform probabilistic operations at any point. In partic- 
ular, it is not always possible to statically determine the number of random variables. 
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Fig.1: The CorePPL and RootPPL toolchain. Solid rectangular components 
(gray) represent programs and rounded components (blue) translations. The 
dashed rectangles indicate paper sections. 


control-flow graph (PCFG). PPL checkpoints are restricted to only occur at 
tail position in these blocks, and communication between blocks is only allowed 
through an explicit PCFG state. As a result, pausing and resuming executions 
is straightforward: it is simply a matter of stopping after executing a block and 
then resuming by running the next block. A variable in the PCFG state, set from 
within the blocks, determines the next block. This variable allows for loops and 
branching and gives the same expressive power as other universal PPLs. We im- 
plement the above approach in RootPPL: a low-level universal PPL framework 
built using C++ and CUDA with highly efficient and parallel SMC inference. 
RootPPL consists of both an inference engine and a simple macro-based PPL. 

A problem with RootPPL is that it is low-level and, therefore, challenging 
to write programs in. In particular, sending data between blocks through the 
PCFG state can quickly get difficult for more complex models. To solve this, we 
develop a general technique for compiling high-level universal PPLs to PCFGs. 
The key idea is to decompose functions in the high-level language to a set of 
PCFG blocks, such that checkpoints in the original function always occur at 
tail position in blocks. As a result of the decomposition, the PCFG state must 
store a part of the call stack. The compiler adds code for handling this call 
stack explicitly in the PCFG blocks. We illustrate the compilation technique by 
introducing a high-level source language, Miking CorePPL, and compiling it to 
RootPPL. Fig. 1 illustrates the overall toolchain. 

In summary, we make the following contributions. 


— We introduce PCFGs, a framework for checkpoint handling in PPLs, and use 
it to implement RootPPL: a low-level universal PPL with highly efficient and 
parallel SMC inference (Section 3). 

— We develop an approach for compiling high-level universal PPLs to PCFGs 
and use it to compile Miking CorePPL to RootPPL. In particular, we give an 
algorithm for decomposing high-level functions to PCFG blocks (Section 4). 


Furthermore, we introduce Miking CorePPL in Section 2 and evaluate the 
performance of RootPPL and the CorePPL compiler in Section 5 on real-world 
models from phylogenetics and epidemiology, achieving up to 6x speedups over 
the state-of-the-art. An artifact accompanying this paper supports the evalua- 
tion [26]. An extended version of this article is also available [27]. A t symbol in 
the text indicates more information is available in the extended version. 
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2 Miking CorePPL 


This section introduces the Miking CorePPL language, used as a source language 
for the compiler in Section 4. We discuss design considerations (Section 2.1) and 
present the syntax and semantics (Section 2.2). 


2.1 Design Considerations 


Miking CorePPL (or CorePPL for short) is an intermediate representation (IR) 
PPL, similar to IRs used by LLVM [6] and GCC [2]. This allows the reuse 
of CorePPL as a target for domain-specific high-level PPLs and PPL compiler 
back-ends. Consequently, CorePPL needs to be expressive enough to allow easy 
translation from various domain-specific PPLs and simple enough for practical 
use as a shared IR for compilers. Therefore, we base CorePPL on the lambda 
calculus, extended with standard data types and constructs. 

We must also consider which PPL-specific constructs to include. Critically, 
most PPLs include constructs for defining random variables and likelihood up- 
dating [21]. CorePPL includes such constructs, including first-class probability 
distributions, to match the expressive power of existing PPLs. 


2.2 Syntax and Semantics 


We build CorePPL on top of the Miking framework [11]: a meta-language system 
for creating domain-specific and general-purpose languages. This allows reusing 
many existing Miking language components and transformations when building 
the CorePPL language. More precisely, CorePPL extends Miking Core—a core 
functional programming language in Miking—with PPL constructs. 

A CorePPL program t is inductively defined by 


t= x | lam x. t | tı t2 | let z = tı intg | Ct|c 


| recursive [let x = t] in 


| match tı with p then tọ else t3 | [t,, t2, ..., tn] (1) 
| {ly = ti, lo = to, ees l3 = t3} 
| assume t | weight t | observe tı t2 | D tı te ... tip 


where the metavariable x ranges over a set of variable names; C over a set of data 
constructor names; p over a set of patterns; l over a set of record labels; and c over 
various literals, such as integers, floating-point numbers, booleans, and strings, as 
well as over various built-in functions in prefix form such as addi (adds integers). 
The notation [let x = t] indicates a sequence of mutually recursive let bindings. 
The metavariable D ranges over a set of probability distribution names, with |D| 
indicating the number of parameters for a distribution D. For example, for the 
normal distribution, |M’| = 2. In addition to (1), we will also use the standard 
syntactic sugar ; to indicate sequencing, as well as if tı then tg else t3 for 
match tı with true then t2 else t3. 
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Fig.2: A toy example encoding a skewed geometric distribution, illustrating 
CorePPL. Part (a) gives the CorePPL program, and part (b) the corresponding 
distribution. The upper part of (b) shows the distribution for (a) with line 4 
omitted, and the lower part of (b) shows it with line 4 included. 


Consider the simple but illustrative CorePPL program in Fig. 2a. The pro- 
gram encodes a variation of the geometric distribution, for which the result is the 
number of times a coin is flipped until the result is tails. The program’s core is 
the recursive function geometric, defined using a function over the probability 
of heads for the coin, p. We initially call this function at line 7 with the argument 
0.5, indicating a fair coin. On line 2, we define the random variable x to have a 
Bernoulli distribution (i.e., a single coin flip) using the assume construct (often 
known as sample in PPLs with sampling-based inference). If the random variable 
is false (tails), we stop and return the result 1. If the random variable is true 
(heads), we keep flipping the coin by a recursive call to geometric and add 1 to 
this result. To illustrate likelihood updating, we make a contrived modification 
to the standard geometric distribution by adding weight (log 1.5) on line 4. 
This construct weights the execution by a factor of 1.5 each time the result is 
heads. Note that CorePPL weight computations are in log-space for numerical 
stability (hence the log 1.5 to factor by 1.5). Thus, the unnormalized probabil- 
ity of seeing n coin flips, including the final tails, is 0.5"-1.5"~'—where 1.5”7! is 
the factor introduced by the n—1 calls to weight. The difference compared to the 
standard geometric distribution is illustrated in Fig. 2b. The weight construct 
is also commonly named factor or score in other PPLs. 

What separates PPLs from ordinary programming languages is the ability to 
modify the likelihood of execution paths, akin to the use of weight in Fig. 2a. We 
often use likelihood modification to condition a probabilistic model on observed 
data. For this purpose, CorePPL includes an explicit observe construct, which 
allows for modifying the likelihood based on observed data assumed to originate 
from a given probability distribution. For instance, observe 0.3 (Normal O 1) 
updates the likelihood with fyy(o,1)(0-3) (note that this can equivalently be ex- 
pressed through weight), where fw(o,1) is the probability density function of 
the standard normal distribution. This conditioning can be related to Bayes’ 
theorem: the random variables defined in a program define a prior distribution 
(e.g., the upper part of Fig. 2b), the use of the weight and observe primitives a 
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likelihood function, and the inference algorithm of the PPL infers the posterior 
distribution (e.g., the lower part of Fig. 2b) 

CorePPL includes sequences, recursive variants, records, and pattern match- 
ing, standard in functional languages. For example, [1, 2, 3] defines a se- 
quence of length 3, {a = false, b = 1.2} a record with labels a and b, and 
Leaf {age = 1.0} a variant with the constructor name Leaf, containing a 
record with the label age. The match construct allows pattern matching. For ex- 
ample, match a with Leaf {age = f} then f else 0.0 checks if ais a Leaf 
and returns its age if so, or 0.0 otherwise. Here, f is a pattern variable that is 
bound to the value of the age element of a in the then branch of the match. 

The data types and pattern matching features in Miking, and consequently 
CorePPL, are not directly related to the paper’s key contributions. Therefore, 
we do not discuss them further. However, the CorePPL compiler in Section 4.3 
supports the features, and the CorePPL models in Section 5 make frequent use 
of them. We consider CorePPL again in Section 4 when compiling to PCFGs. 


3 PPL control-flow graphs and RootPPL 


This section introduces the new PCFG concept (Section 3.1) and shows how to 
apply SMC over these (Section 3.2). Finally, we present the PCFG and SMC- 
based RootPPL framework (Section 3.3). 


3.1 PPL Control-Flow Graphs 


In order to handle checkpoints efficiently without CPS or non-preemptive mul- 
titasking, we introduce PPL control-flow graphs (PCFGs). In contrast to tra- 
ditional PPLs, where checkpoints are most often implicit, we make them ex- 
plicit and central in the PCFG framework. The main benefit of this approach 
is that the handling of checkpoints in inference algorithms is greatly simplified, 
which allows for implementing the framework in low-level languages. However, 
the explicit checkpoint approach makes PCFGs relatively low-level, and they are 
mainly intended as a target when compiling from high-level PPLs. We introduce 
such a compiler in Section 4. 

Formally, we define a PCFG as a 6-tuple (B, S, sim, bo, bstop, £). The first 
component B is a set of basic blocks inspired by basic blocks used as a part 
of the control-flow analysis in traditional compilers [8]. In practice, the blocks 
in B are pieces of code that together make up a complete probabilistic pro- 
gram. Unlike basic blocks used in traditional compilers, we allow these pieces of 
code to contain branches internally. The second component S' is a set of states, 
representing collections of information that flow between basic blocks. In prac- 
tice, this state often contains local variables that live between blocks and an 
accumulated likelihood. The blocks and states form the domain of the function 
sim: Bx S + Bx Sx {false, true}. This function performs computation specific 
for the given block over the given state and outputs a successor block indicating 
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(b) 


Fig. 3: A PCFG illustration. Part (a) shows an example PCFG. The arrows de- 
note the possible flows of control between the blocks, with regular arrows denot- 
ing checkpoint transitions and arrows with open tips non-checkpoint transitions. 
Part (b) shows a possible execution sequence with sim for (a). 


Algorithm 1 A standard SMC algorithm applied to PCFGs. 


Input: A PCFG (B, S, sim, bo, bstop, £). A set of initial states {s,}0_). 
Output: An updated set of states {s,}4_,. 


1. Initialization: For each 1 < n < N, let an := bo and cn := false. 

2. Propagation: If all an = bstop, terminate and output Sy reas If not, for each 
1 < n < N where cn = false, let (an, Sn, Cn) “= sim(Gn, Sn). If all cn = true, go 
to 3. If not, repeat 2. 

3. Resampling: For each 1 < n < N, let pn = L(sn)/ XÈ; L(si). For each 1 
n < N, draw a new index i from {i}, with probabilities {p;},. Let (5%, bh) : 
(si, bi). Finally, for each 1 < n < N, let (Sn, bn, Cn) = (sh, bh, false). Go to 2. 


I] 1A 


what to execute next, an updated state, and a boolean indicating whether or 
not there is a checkpoint at the end of the executed block. 

To illustrate this formalization, consider the PCFG in Fig. 3a for which 
B = {bo,b1,..., 64, bstop}. The block bo is present in every PCFG and represents 
its entry point. Similarly, the block bstop is a unique block indicating termination, 
which must be reachable from all other blocks. For some initial state so € S, 
Fig. 3b illustrates a possible execution sequence starting at bo in Fig. 3a before 
terminating at bgtop. The structure of a PCFG restricts checkpoints to only occur 
at the end of basic blocks and confines communication between blocks to the 
state. These restrictions greatly simplify inference algorithm implementations. 
More precisely, rather than relying on CPS or non-preemptive multitasking, the 
inference algorithm can simply run a block b with sim, handle the checkpoint, 
and then run the successor block indicated by the output of sim. 


3.2 SMC and PCFGs 


To prepare for introducing RootPPL in Section 3.3, we present how to apply 
SMC inference to PCFGs. The work by Naesseth et al. [33] contains a more 
general and pedagogical introduction to SMC. At a high level, SMC inference 
works by simulating many instances—known as particles in SMC literature—of 
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a PCFG program concurrently, occasionally resampling the different particles 
based on their current likelihoods. In CorePPL, for example, such likelihoods 
are determined by weight and observe. Resampling allows the downstream 
simulation to focus on particles with a higher likelihood. 

In order to apply SMC inference over PCFGs, we need some way of deter- 
mining the likelihood of the SMC particles. For this, we use the final component 
of the PCFG definition, £ : S + Rso, which is a function mapping states to a 
likelihood (a non-negative real number). Concretely, this likelihood is most often 
stored directly in the state as a real number, and £ simply extracts it. 

Algorithm 1 defines an SMC algorithm over PCFGs. It takes a PCFG as 
input, together with a set of N states {s,}_,, which represent the SMC par- 
ticles. Step 1 in the algorithm sets up variables a, and cn, indicating for each 
particle its current block and whether or not a checkpoint has occurred in it. 
Step 2 simulates all particles that have not yet reached a checkpoint using sim. 
This step repeats until all particles have reached a checkpoint (this is a synchro- 
nization point for parallel implementations). Step 3 uses the likelihood function 
L to compute the relative likelihoods of all particles and then resamples them 
based on this. That is, we sample N particles from the existing N particles (with 
replacement) based on the relative likelihoods. After resampling, we return to 
step 2. If all particles have reached the termination block bstop, the algorithm 
terminates and returns the current states. 

Note in Algorithm 1 that the input states are not required to be identical. For 
example, each state should have a unique seed used to generate random num- 
bers (e.g., with assume in CorePPL). Non-identical initial states in Algorithm 1 
imply that different particles may traverse the blocks in B differently and reach 
checkpoints at different times. Although this means that different particles can 
be at different blocks concurrently, the SMC algorithm is still correct [24]. This 
PCFG property is essential as it allows for the encoding of universal probabilis- 
tic programs in PCFG-based PPLs. Furthermore, it implies that some particles 
may reach bgtop earlier than others. To solve this, we require in Algorithm 1 that 
sim(bstop, S) = (Ystop, $, true) holds for all states s. That is, particles that have 
finished also participate in resampling and cannot cause step 2 to loop infinitely. 

Next, we describe our implementation of PCFGs with SMC: RootPPL. 


3.3 RootPPL 


We make use of the PCFG framework when implementing RootPPL: a new 
low-level PPL framework built on top of CUDA C++ and C++, intended 
for highly optimized and massively parallel SMC inference on general-purpose 
GPUs. RootPPL consists of two major components: a macro-based C++ PPL 
for encoding probabilistic models and an SMC inference engine. 

The macro-based language has two purposes: to support compiling the same 
program to either CPU or GPU and to simplify the encoding of models for 
programmers. As a result, the macros hide all hardware details from the pro- 
grammer. To illustrate this macro-based PPL, consider the example RootPPL 
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BBLOCK(init, progState_t, { 
PSTATE.x = SAMPLE(normal, 0.0, 100); 
PSTATE.t = 0; 
NEXT=iter; 
BBLOCK_JUMP (iter) ; struct progState_t { 
}) double x; 


int t; 


ojoon bune 


BBLOCK (iter, progState_t, { 

PSTATE.x = SAMPLE(normal, PSTATE.x + 2.0, 1); }; 
10| OBSERVE(normal, PSTATE.x, 5.0, data[PSTATE.t]); 
11| if (++PSTATE.t == T) NEXT=NULL; (b) Program state 
12|}) 


© 


(a) RootPPL program 


Fig. 4: Part (a) illustrates a RootPPL program encoding the state-space model 
in (2). The text provides details. We set NEXT at line 4 rather than in iter as an 
optimization. Part (b) defines the RootPPL program state type progState_t. 


program in Fig. 4a. This program encodes a simple state-space model for an 
object moving along an axis in R, given by 


Xo ~N(0,100), X: ~ N(zi1 +2,1), Yi~N(ai,5), 1<t<T. (2) 


Here, Xo is the initial position, X; the following positions, and Y; a set of noisy 
observations of the object position. The inference goal is to determine the dis- 
tribution of Xr (the final position of the object) conditioned on all Y;. 

Fig. 4a implements (2) with two basic blocks, introduced with the BBLOCK 
macro in RootPPL. The first block init draws Xo using the SAMPLE macro 
(equivalent to assume in CorePPL) on line 2 and stores the drawn value in the 
program state variable x through the PSTATE macro. This program state is the 
RootPPL instantiation of the PCFG state introduced in Section 3.1. Another 
program state variable, t (corresponding to the index t in the model), is ini- 
tialized on line 3. As preparation for iterating over the iter block, we set the 
NEXT construct to iter at line 4. Finally, the block exits by making a direct 
non-checkpoint transition to iter using the BBLOCK_JUMP macro at line 5. 

In iter, we sample Xj at line 9 and write the result to x (overwriting the 
previous Xo, which is no longer needed). Line 10 updates the likelihood using 
the OBSERVE macro (equivalent to observe in CorePPL), corresponding to ob- 
serving Y; in the model. We access all Y, through the data array, a shared global 
constant, avoiding memory duplication in the program state. Finally, at line 11, 
we check if we are at time T (a shared global constant for T). If this is the case, 
NEXT is set to NULL, indicating termination. This is equivalent to moving to bstop 
in the PCFG formalization. Otherwise, NEXT keeps its value set at line 4 and 
jumps to the beginning of the iter block. Not using BBLOCK_JUMP allows iter 
to return to the inference engine between iterations, indicating checkpoint tran- 
sitions. In RootPPL, this means that SMC inference will resample the instances 
before returning to iter for the next iteration. 

The programmer defines the RootPPL program state for each RootPPL pro- 
gram as an arbitrary C++ struct type and passes this type (e.g., progState_t 
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in Fig. 4a) to each basic block. The PSTATE macro accesses the variables in the 
struct. Fig. 4b illustrates the program state for the example program in Fig. 4a. 
As described in Section 3.1, this program state is the only possible means to 
pass data from one basic block to another in RootPPL. 

This minimal example does not illustrate all RootPPL language features (e.g., 
WEIGHT). Further details on the RootPPL language are available at GitHub [4]. 

The second part of the RootPPL framework is the SMC inference engine. 
It is crucial to take advantage of the highly parallel nature of SMC and avail- 
able hardware for parallelization to achieve high performance. For this purpose, 
RootPPL supports compilation to either C++ on single-core, C++ on multicore 
through OpenMP [3], and CUDA C++ [1] with massive parallelism on the GPU. 

We present the main inference loop in RootPPL below (cf. Algorithm 1). 


BR 


. Initialize random seeds. 

2. Execute the basic block indicated by NEXT for all particles. This execution 
may include a chain of blocks with non-checkpoint transitions between them 
(using the BBLOCK_JUMP macro) before returning to the inference engine. 

3. If all particles have terminated (i.e., NEXT = NULL), stop. 

4. Resample all particles and go to 2. 


The random seeds in step 1 are initialized differently depending on the compile 
target. For plain C++ on a single core, one seed is shared between all particles 
because they are executed sequentially. However, for OpenMP and CUDA, the 
parallel execution requires that we assign each thread a unique seed shared 
between all particles running on it. For CUDA, these seeds are placed in thread- 
local CUDA memory for each particle to minimize memory overhead when using 
SAMPLE (which is performance-critical). In addition, when compiling to CUDA, 
we initialize the seeds in parallel using a CUDA compute kernel. 

Step 2 executes the particles sequentially, in parallel using OpenMP threads, 
or in parallel using a CUDA compute kernel. Step 3 then performs a termi- 
nation check. First, we check if the first particle has terminated. If it has not 
terminated, we directly move to the resampling step. If it has terminated, we it- 
eratively check other particles to either find a particle that has not terminated or 
conclude that all particles have terminated and stop the inference. This approach 
both allows for particles terminating at different times and introduces minimal 
overhead for the case when all particles terminate simultaneously (which is quite 
common). When all particles terminate simultaneously, it is enough to check the 
first particle in all iterations of step 3 except the last. 

The resampling step is the most difficult one to parallelize efficiently. The 
reason is the normalizing sum (e.g., ae L(si) in Algorithm 1) that we must 
compute in order to determine resampling probabilities. We use systematic re- 
sampling for single-core and OpenMP and parallel systematic resampling for 
CUDA, as described in Murray et al. [31] (we do not use in-place propagation). 
We compute the normalizing sum in parallel via the Thrust library [7] for CUDA. 

Another important consideration for the inference engine is memory allo- 
cation. In particular, the memory allocated for NEXT, the likelihood, and the 
PSTATE for each particle, is laid out as separate arrays in memory, rather than 
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one big array of structs. This approach, known as memory coalescing, avoids 
strided memory accesses in global memory and is preferred for parallel opera- 
tions, particularly for CUDA. Another memory consideration is particle dupli- 
cation during resampling. For this, we use a custom aligned memory transfer 
in CUDA because the standard memcpy implementation in CUDA proved to be 
a bottleneck. With a single core and OpenMP, memcpy runs without issue. Ad- 
ditionally, we perform a specific optimization when copying the program state 
used in the CorePPL compiler. This program state consists of a possibly large 
stack (with user-definable size) together with a stack pointer, and we ensure not 
to copy the unused part of the stack located beyond the stack pointer. This is a 
critical optimization for the CorePPL compiler. 

Other things supported in RootPPL are the estimation of normalizing con- 
stants for encoded models and adaptive resampling based on the current effective 
sample size (ESS). These are standard concepts in SMC inference. For more de- 
tails, see, e.g., Naesseth et al. [33]. 

Next, we use RootPPL as the target language for the CorePPL compiler. 


4 Compiling to PCFGs 


This section introduces the ideas for compiling high-level universal PPLs to 
PCFGs. We present the key transformation—function decomposition into basic 
blocks—using a toy example (Section 4.1), a formal algorithm (Section 4.2), a 
high-level overview of the CorePPL-to-RootPPL compiler (Section 4.3), and the 
compilers strengths and limitations (Section 4.4). 


4.1 Function Decomposition Example 


The major challenge when compiling high-level PPLs is implementing pausing 
and resuming at checkpoints to yield control to an inference algorithm temporar- 
ily. Pausing and resuming in low-level languages is especially difficult due to run- 
time limitations. We solve this problem by compiling to the PCFGs introduced in 
Section 3, specifically designed for implementation in low-level target languages. 
A challenge with this approach is that checkpoints can occur at arbitrary loca- 
tions in high-level probabilistic programs, whereas in PCFGs, checkpoints must 
always occur at tail position in basic blocks. We solve this by decomposing func- 
tions in the source language into a set of basic blocks. Our approach is similar 
to how functions are decomposed into basic blocks in standard compilers such 
as GCC [2] and LLVM [6] (see, e.g., Aho et al. [8]). The difference is that we 
only decompose as needed, based on where checkpoints occur. In particular, we 
do not decompose functions, and parts of functions, in which checkpoints are 
guaranteed not to occur. This allows for more optimizations by the underlying 
compiler (e.g., NVCC or GCC for RootPPL). 

Consider the toy CorePPL function in Fig. 5a and the resulting compila- 
tion to a RootPPL PCFG in Fig. 5c. For this example, we introduce an explicit 
SMC checkpoint resample in CorePPL, indicating where SMC should pause 
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(c) Compiled RootPPL PCFG illustration. Some RootPPL constructs are omitted or 
slightly modified for readability. In particular, we omit the BBLOCK construct used in 
Fig. 4a. Instead, we illustrate the blocks as nodes in a graph, numbered by indices. The 
arrows indicate control flow between the blocks, with the incoming arrow to block 1 
representing the call to f and the outgoing arrow from block 4 representing the return 
from f. 


Fig. 5: Compilation of a CorePPL program (a) to a RootPPL PCFG (c). Part 
(b) illustrates an intermediate ANF representation of (a) and also indicates the 
parts of the program corresponding to the blocks in (c). We provide further 
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(a) Source CorePPL program. 


recursive let f: Float -> Float = 


1|recursive let f: Float -> Float = 
lam p. 2| lam p. 
let s1 = assume (Gamma p p) in 3| [let s1 = assume (Gamma p p) in 
resample; 4| |resample; l 
let s2 = 5| |let t1 = geqf si 1. in 
if geqf s1 1. then 2. 6| |let s2 = if t1 then 2. else 3. in 
else 3. in 7| |let t2 = leqf s2 4. in 
let s3 = 8| |let s3 = 
if leqf s2 4. then 9 if t2 then 
let s4 = 10 let t3 = eqf s2 5. in 2 
if eqf s2 5. then 6. 11 let s4 = 
else f 7. in 12 if t3 then 6. else f 7. in 
addf s4 s4 13 addf s4 s4|3 
else 8. in 14 else 8. in 
mulf s3 s3 15| |mulf s3 s3/4 
in 16}in 


(b) Intermediate ANF representation. 


2 
1 | 1|struct STACK_f *sf = ...; 
2|char ti = sf->s1 >= 1.; 
1/struct STACK_f *sf = 3|double s2; 
2| PSTATE.stack alif (t1 == 1) { s2 = 2.; } 
3| + PSTATE.stackPtr 5ļelse { s2 = 3.; } 
4| - sizeof(struct STACK_f); 6|char t2 = s2 <= 4.; 
5|sf->s1 = zlif (t2 == 1) { 
6| SAMPLE(gamma, sf->p, sf->p); 8| char t3 = s2 == 5.; 
7|NEXT = 2; 9| if (t3 == 1) { 
10 sf->s4 = 6.; 
A 11 BBLOCK_JUMP (3) ; 
3 12| } else { 
ilstruct STACK_f *sf = ...; lq 13 struct STACK_f *callsf = 
14 PSTATE. stack 
Ae iR cee. SBRR TEES 15 + PSTATE. stackPtr; 
3 |BBLOCK_JUMP (4) ; 16 callsf->ra = 3; 
17 callsf->p = 7.; 
4 18 callsf->retValLoc = 
19 &(sf->s4) 
1]/struct STACK_f *sf = ...; <H || 20 - PSTATE.stack; 
2|double t = sf->s3 * sf->s3; 21 PSTATE.stackPtr = 
3|*(PSTATE.stack + sf->retValLoc) = t; 22 PSTATE.stackPtr 
4|PSTATE.stackPtr = 23 + sizeof(struct STACK_f); 
5| PSTATE.stackPtr 4 24 BBLOCK_JUMP (1) ; 
6| - sizeof(struct STACK_f); 25| } 
7 |BBLOCK_JUMP (sf->ra); 26|} else { 
27| sf->s3 = 8.; 
l 28| BBLOCK_JUMP(4); 
29|} 


details in the text. 
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executions in order to resample. The resample construct is the sole checkpoint 
considered in this example (and the CorePPL compiler), but the method gener- 
ally applies for arbitrary checkpoints. Optimally, the resample construct should 
be automatically inserted by the compiler [25]. However, we do not consider this 
problem in this paper and assume resamples are inserted prior to compilation. 
The first step in the decomposition is to translate the program into A-normal 
form (ANF) [15], illustrated in Fig. 5b. ANF is commonly used in compilers and 
ensures that non-trivial expressions (e.g., function applications and checkpoints) 
are always name-bound. For CorePPL, ANF guarantees that the body of each 
let expression, or expression in tail position, is trivial, contains at most one 
function application, or is an if expression with a trivial condition, resulting 
in simplified decomposition. We will use the program in Fig. 5b as the target 
for decomposition in the following. Note that variables introduced by ANF start 
with a t in Fig. 5b, while the original variables from Fig. 5a start with an s. 


The goal with the decomposition is to ensure that we immediately return 
control to the inference engine at checkpoints. In the PCFG framework, the only 
way to fulfill this is to ensure that checkpoints occur at tail position in basic 
blocks. First, consider the resample checkpoint at line 4 in Fig. 5b, causing a 
split into blocks 1 and 2 in the compiled RootPPL PCFG in Fig. 5c. Note that in 
block 1, NEXT is set to 2 at line 7 before returning, indicating that the inference 
engine should resume execution at block 2 after handling the checkpoint, also 
illustrated by a closed arrow. Note the stack frame pointer sf in block 1 for 
this invocation of f, which points to a location in an explicit call stack in the 
RootPPL program state PSTATE. We require such a call stack due to compiling 
to PCFGs—any data that lives between basic blocks (e.g., a call stack), such 
as s1, must be put in the program state. We define the stack frame pointer sf 
equivalently at the top of all blocks for the decomposed function f in Fig. 5c but 
replace the definition with ... in blocks other than the first for brevity. 


It is not sufficient to split into blocks at explicit checkpoints. Consider, for 
example, the recursive call to f in the else branch on line 12 in Fig. 5b. During 
this function call, we encounter at least one resample, resulting in at least one 
block split within the function, meaning that all data required by f must be put 
in an explicit stack frame and stored in the program state. If not, we lose the 
data between the basic blocks of f. In particular, the block return address ra is 
stored in the stack frame, indicating which block to return to at the end of the 
function call. In the case of the call to f at line 12 in Fig. 5b, we must return 
to line 13. Therefore, we must place line 13 at the beginning of a basic block in 
Fig. 5c (block 3). In general, we must place all calls to decomposed functions (i.e., 
functions that may, directly or indirectly, encounter a checkpoint) at tail position 
in basic blocks. Besides line 13 in Fig. 5b, this also means that line 15 in Fig. 5b 
cannot be part of block 2. It cannot be part of block 3 either because it may be 
executed independently of line 13 in Fig. 5b if we take the else branch of the 
if at line 9 in Fig. 5b. Consequently, we must put it in a separate block (block 
4 in Fig. 5c). The decomposition of function applications and if expressions is 
similar to how standard compilers decompose machine instructions into basic 
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blocks (sequences of instructions without any internal jumps or branches) [8]. 
The difference, however, is that we do not split into blocks at all if expressions 
and function calls. For example, the if at line 6 in Fig. 5b is guaranteed not to 
include a checkpoint and can be left untouched (lines 4-5 in Fig. 5c). Similarly, 
the call to geqf at line 5 in Fig 5b is guaranteed not to encounter any checkpoints. 
Conservatively determining which functions are guaranteed not to encounter any 
checkpoints can be done through static analysis. Such a static analysis phase is 
part of the CorePPL compiler, described in Section 4.3. 

We now take a closer look at the call stack handling in Fig. 5c. The following 
description is specific for RootPPL, but similar solutions must be applied if 
compiling to other target languages utilizing PCFGs. First, the program state 
PSTATE consists of a byte array stack and a pointer to the top of this stack named 
stackPtr. We increment and decrement this stack pointer when stack frames 
are added and removed, respectively, at function calls and returns. The type 
STACK_f represents the stack frame for the function f (such a stack frame type 
must be determined and set up for each function we decompose) and contains 
its block return address ra, its parameter p (functions with multiple parameters 
have one entry for each parameter), and an address retValLoc at which we write 
its return value. Additionally, it contains the local variables s1, s3, and s4 that 
travel across the blocks in f. Note, however, that local variables used only within 
a single block do not need to go in the stack frame (e.g., t1 and s2), and the 
underlying target language (e.g., CUDA for RootPPL) can instead handle them 
directly. Lines 13-24 in block 2 in Fig. 5c illustrate the recursive call to f at line 
12 in Fig. 5b. Here, we allocate a new complete stack frame callsf and initialize 
ra, p, and retValLoc. Allocating the complete stack frame prior to the function 
call is different from most standard compilers, which most often allocate the part 
of the stack frame containing local variables at the start of the called function. 
This strategy allows for making the allocation size dependent on, e.g., function 
arguments. Here, we instead know all stack frame sizes at compile time. After 
setting up the stack frame, we increment the stack pointer at lines 21-23 and 
pass control to the recursive invocation of f by using BBLOCK_JUMP at line 24. 
Inversely, we illustrate function return in block 4 on lines 3-7. First, we set the 
return value, and second, we decrement the stack pointer. Finally, we retrieve 
the return block from the stack frame and pass control to this block at line 7. 


4.2 Function Decomposition Algorithm 


We now turn to a formal description of the decomposition algorithm. To avoid 
going into specifics of the underlying target language, and in particular the call 
stack handling, we take an abstract view of function bodies and regard them as 
lists of statements of the form 


stmt = checkpoint | call | if [stmt] [stmt] | other. (3) 


Here, the [stmt] syntax indicates a list of stmts. Thus, the if construct induc- 
tively contains two lists of stmts—one for each branch. 
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1|[ 1[ 2| other 
2| other, 2| other, T 3} if 
3| checkpoint, 3| checkpoint 2 4 [ other ] 
4| other, al] 5 | other ] 
5| if [other] [other], 6 oiner 
6| other, 3 7| if | 
7| if [| 8 other 

a o af [ 
8 other, 5| others 
9 if . a 10 other, 

3| jump 4 : 
10 [other] al] 11 jump 3 
11 [call], 12 | [ 
12 other 13 call 3 

; 4 
13| ] [other], iá ] 
14| other alf is} ]I 
15|] 2| other, a |16 other, 
3| jump return IT jump 4 
(a) The program from Fig. 5b al] s| ] 
translated to type [stmt]. i 19|] 


(b) Decomposition of (a) into [tstmt] basic blocks. 


Fig. 6: Illustrating Algorithm 2 on the example from Fig. 5. 


We illustrate the representation stmt through an example. Consider the pro- 
gram in Fig. 5b and its mapping to stmts in Fig. 6a. Due to ANF, we can view 
the body of f as a sequence of let bindings and operations separated by ;, 
each performing a single operation of some kind (e.g., a checkpoint or a function 
application). We map each such operation to a stmt in Fig. 6a. The resample 
checkpoint at line 4 in Fig. 5b maps to a checkpoint at line 3 in Fig. 6a, and 
the application of f at line 12 maps to a call at line 11. However, other applica- 
tions, such as geqf and leqf, are guaranteed not to encounter any checkpoints. 
Therefore, they map to others, and not calls. The three ifs at lines 6, 9, and 
12 map to ifs. Note that we always lift the if conditions in Fig. 5b to a separate 
let as a result of ANF, and they are therefore not part of the if representation 
in stmt. We map all remaining operations to others. 


While the illustration above only shows how to map a CorePPL function body 
to stmts, the representation is general. For example, in the CorePPL compiler 
(Section 4.3), the decomposition is performed after translation to C, and not at 
the CorePPL stage. The reason is that there are no basic blocks in CorePPL. It 
is, therefore, more natural to perform this translation closer to RootPPL. 


We now turn to the full decomposition algorithm over lists of stmts, given 
in Algorithm 2. The target language representation is a small extension of stmt, 


44 D. Lundén et al. 


Algorithm 2 A functional-style algorithm for function decomposition into basic 
blocks. We denote tuples with comma-separated expressions within parentheses 
and sequences with comma-separated items within square brackets. We denote 
type annotation with the : character, the cons operator with :: characters, and se- 
quence concatenation with +. The non-pure function newlndex returns a unique 
number from N at every call. 


1 function DECOMPOSE srcs: [stmt] > (N > [tstmt]) = 
2 let (block, blocks, _) = REC ([], Ø, return) srcs in 
3 blocks U (newlndex (), block) 
4 
5 function INITNEXT next: next, — next = 
6 match next with none — newlndex () | _ — next 
7 
8 function REC (block, blocks, next) srcs: acc — [stmt] + acc = 
9 match srcs with 
10 | [| + match next with 
11 | none — (block, blocks, next) 
12 | n | return —> (block ++ [jump next], blocks, next) 
13 | src :: srcs — match sre with 
14 | checkpoint | call — match srcs with 
15 Te 
16 let next = INITNEXT next in 
17 (block + [src next], blocks, next) 
18 [=> 
19 let index = newlndex () in 
20 let block = block + [src index] in 
21 let (nextBlock, blocks, next) = REC ([], blocks, INITNEXT next) srcs in 
22 (block, blocks U (index, nextBlock), next) 
23 | other > REC (block + [other], blocks, next) srcs 
24 | if thn els — match srcs with 
25 Te 
26 let (thn, thnBlocks, thnNext) = REC ([], blocks, next) thn in 
27 let (els, elsBlocks, elsNext) = REC ([], thnBlocks, thnNext) els in 
28 let thn = if next 4 elsNext A thnNext = none 
29 then thn ++ [jump elsNext] else thn in 
30 (block + [if thn els], elsBlocks, elsNext) 
31 |_ > 
32 let (thn, thnBlocks, thnNext) = Rec ([], blocks, none) thn in 
33 let (els, elsBlocks, elsNext) = REC ([], thnBlocks, thnNext) els in 
34 if elsNext = none then REC (block + [if thn els], elsBlocks, next) srcs 
35 else 
36 let thn = if thnNext = none then thn + [jump elsNext] else thn in 
37 let (nextBlock, blocks, next) = 
38 REC ([], elsBlocks, INITNEXT next) srcs in 


39 (block ++ [if thn els], blocks U (elsNext, nextBlock), next) 
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adding transitions between N-indexed basic blocks. It is given by 


tstmt := checkpoint next | call next 


(4) 


| if [tstmt] [tstmt] | jump next | other. 


In particular, we annotate checkpoints and calls with the type next, given by 
next ::= return | n, where n € N. For checkpoints, the next indicates which 
block to jump to after handling the checkpoint, and for calls, it indicates the 
block to return to (e.g., the value set for ra in Fig 5c) at the end of the function 
invocation. We also include a jump in tstmt for directly jumping to another block 
(corresponding to BBLOCK_JUMP in Fig. 5c). The return case of next indicates 
that the return address gives the next block for the current function call. For 
example, BBLOCK_JUMP(sf->ra) is equivalent to jump return. 

Fig. 6b shows the result of applying Algorithm 2 on the [stmt] in Fig. 6a. 
Note that the block structure in Fig. 6b mirrors that of Fig. 5c. The entry point 
in Algorithm 2 is the function DECOMPOSE, which accepts a [stmt] as input, 
and produces a map from indices to [tstmt] as output (e.g., Fig 6b). The core of 
Algorithm 2 is the function REC, which recursively constructs the basic blocks. 
It is called from DECOMPOSE, and makes use of the function INITNEXT. The 
accumulator is the triple (block, blocks, next) of type acc = [stmt] x (N > 
[stmt]) x next, where block is the current block being constructed, blocks are 
all blocks constructed so far, and next indicates the action to take at tail position 
in the current block. The type next, is defined as next; ::= next | none. When 
reaching the end of a block, a value none for next means do nothing, a value 
return indicates that the next block is the return block for the current function 
invocation, and a natural number n means that the next block has index n. 

We now walk through the translation of Fig. 6a to Fig. 6b. We set the ac- 
cumulator to (||, Ø, return) at line 2 in Algorithm 2 just before the initial call 
to REC, indicating that the current block is empty, that we have accumulated 
no complete blocks so far, and that we must use the return block address when 
reaching the end of the current block. In the first call to REC, the other at 
line 2 in Fig. 6a triggers the case at line 23 in Algorithm 2, which accumulates 
the other in the current block. Next, the checkpoint triggers the case at line 
14, followed by line 18, since the checkpoint is not at tail position. At line 
19, we create a new index for the following block. We then close the current 
block by tagging the checkpoint with the new index, resulting in block 1 in 
Fig. 6b. Next, we recursively create the block following the checkpoint at line 
21. Finally, we add the recursively created block with the new index to the map 
of complete blocks (now also populated by the recursive call) and return the 
updated accumulator triple at line 22. 

The complex part of Algorithm 2 involves handling of ifs. In particular, we 
must handle cases where there are block splits within the branches with care. 
In our example, the first if at line 5 in Fig. 6a triggers the case at line 31 since 
it is not in tail position. To determine whether or not there is at least one split 
within the branches, we set next to none for the call on line 32. If a block is split 
during this call, INITNEXT will be applied on next, and thnNext at line 32 will 


46 D. Lundén et al. 


Miking ANF : = | 
CorePPL 5) sen ton Static Analysis 


C Translation 
RootPPL 4 Code a Function 

Language Generation Decomposition (emmm) 
Fig. 7: The main components of the CorePPL-to-RootPPL compiler. Grey blocks 
are programs, and blue blocks are transformations or analyzes. 


be a natural number, indicating where the branch jumped to (either through a 
jump, checkpoint, or call) at tail position. However, if there is no split in the 
branch, the resulting thnNext remains none. There is no split in the first branch 
of the if at line 5 in Fig. 6a, and none is passed to the recursive call at line 33 
as well. Again, there is no split in the second branch, triggering the then case at 
line 34, and we accumulate the if in the same way as an other. 

The ifs at lines 7 and 9 in Fig. 6a do contain a split due to the call at line 
11, resulting in blocks 2, 3, and 4, shown in Fig. 6b. The elsNext is a natural 
number for these ifs, and the else case at line 35 is triggered. Here, we must 
take particular care if there is only a split in the second branch of the if and not 
the first. In that case, thnNext is none, and unlike the second branch, we do not 
add a block jump to the end of this branch in the call at line 32. Therefore, we 
must instead add it at line 36. We add the jump at line 11 in block 2 in Fig. 6b 
in this way. Note that we do not require an equivalent step to the above for the 
second branch if the split is only in the first branch, since we pass the next from 
the first branch to the recursive call for the second branch. After handling the if 
itself, we recursively create the new block following the if at lines 37-38 (note 
that we pass the next given as argument to REC here, and use INITNEXT on it 
to indicate a split has occurred), and give it the index ELSNEXT at line 39. 

The case where if is at tail position, at line 25, is handled similarly to the 
case at line 31. The difference is that we do not pass none to the first branch 
since there is nothing following the if which we can jump to. Instead, we directly 
pass the current next to the first call at line 26. 

In the blocks resulting from Algorithm 2, call and checkpoint only occurs 
in tail-position by construction. As discussed in Section 4.1, this is precisely the 
required property when compiling to PCFGs. 


4.3 CorePPL-to-RootPPL Compiler 


Fig. 7 gives an overview of the CorePPL-to-RootPPL compiler components. Be- 
sides the techniques described previously, an integral part of the compiler is the C 
translation step, which translates many of the CorePPL language features to C, 
including data type definitions and pattern matching. More precisely, CorePPL 
records and variants are translated to C structs and tagged unions, respectively, 
while pattern matching is compiled to C if statements. 

A simple static analysis phase discovering functions that are guaranteed not 
to encounter any resamples is also part of the compiler. It iterates through all 
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functions and marks a function as containing a resample if it either directly 
contains a resample or calls another function containing a resample. We do 
not need to decompose resample-free functions, and invocations can be handled 
directly by the C++ or CUDA compiler (and we do not need to set up an explicit 
stack frame). An example of such a function invocation is the geqf s1 1. at line 
5 in Fig. 5b. We disallow passing functions as arguments to other functions as 
it complicates the analysis. A solution to allow passing functions as arguments 
is to use static analysis techniques such as 0-CFA [35] instead. 

The code generation stage in Fig. 7 adds RootPPL boilerplate code and emits 
a complete RootPPL program that is provided as input to a C++ or CUDA 
compiler together with the RootPPL inference engine (see Fig. 1). The CorePPL 
compiler implementation is hosted at GitHub [4] and consists of approximately 
3000 lines of code (a contribution of this paper). Note that the ANF, static 
analysis, and C translation steps are quite standard, with no new contributions. 

An important detail concerning memory allocation in the compiler is the 
translation between relative and absolute addresses. Fig. 5c illustrates this trans- 
lation. On line 3 in block 4, we convert the retValLoc relative pointer to an 
absolute pointer prior to dereferencing, and at lines 18-20 in block 2, the ad- 
dress of s4 is translated to a relative address with respect to the start of the 
stack before being assigned to retValLoc. This translation is needed because, 
at checkpoints in RootPPL, resampling copies and moves SMC executions in 
memory. Therefore, we cannot use absolute addresses to refer to data on the 
PSTATE stack and must instead use addresses relative to the start of the stack. 


4.4 Compiler Strengths and Limitations 


The main strength of the CorePPL compiler, compared to using other PPL com- 
pilers and tools, is the execution time of the compiled programs. In particular, 
the compilation from a universal PPL to CUDA is the first of its kind and allows 
for utilizing GPUs for massively parallel SMC inference. 

The compiler does, however, have some limitations. Most importantly, the 
lack of standard garbage collectors in C++ and CUDA leads to restrictions for 
automatic data allocation. Currently, we support only stack-based allocation, 
which means that CorePPL programs that allocate and return dynamically sized 
data structures (e.g., trees or linked lists) from functions are not supported. Con- 
sequently, the current compiler cannot handle probabilistic programs encoding 
distributions over such data structures (e.g., phylogenetic trees)—the distribu- 
tion must be over fixed-size data types. However, as the evaluation in Section 5 
suggests, practically significant universal probabilistic programs over fixed-sized 
data types are plentiful. In general, the compiler supports universal CorePPL 
programs including both stochastic branching and an unbound number of (stack- 
allocated) random variables. Automatic heap-based data allocation is a general 
challenge when compiling to GPUs and not specific to our approach. Exploring 
the use of garbage collectors or other means for automatic memory management 
on GPUs is an interesting direction for future research. 
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The compiler also lacks support for some features, which we foresee no sub- 
stantial technical challenges in implementing in the near future. In particu- 
lar, the compiler does not support first-class distributions—we restrict distri- 
butions to occur immediately at assumes (e.g., the Bernoulli distribution in 
assume (Bernoulli p) in Fig. 2a). Another possible feature is to add limited 
support for nested and higher-order functions. 


5 Evaluation 


This section evaluates RootPPL and the CorePPL-to-RootPPL compiler. The 
source code for all experiments is publicly available [26]. We compare RootPPL 
and CorePPL to state-of-the-art SMC PPL implementations on two models: a 
constant rate birth-death (CRBD) model from evolutionary biology (Sections 5.1 
and 5.3) and a vector-borne disease model from epidemiology (Section 5.2). 
Previous work shows that SMC handles these models particularly well [36,28], 
and they are therefore good candidates for this evaluation. Comparison with 
other types of inference algorithms is a challenging problem and beyond the 
scope of this paper. For example, comparing SMC with variational inference 
(VI) is challenging as VI is approximate and SMC is asymptotically exact. 

In addition to CorePPL (compiled to RootPPL) and RootPPL (hand-tuned), 
we implement the models above in a set of state-of-the-art PPLs with SMC 
inference: Birch [32], WebPPL [20], and Pyro [10]. For each PPL, we implement 
the two models as efficiently as possible, given the available language features. We 
compile RootPPL with GCC 7.5.0 for single-core and multicore and with CUDA 
11.4 for GPU. We compile Birch 1.634 with GCC 7.5.0. We use WebPPL 0.9.15 
with Node.js 14.17.6. We use Pyro 1.7.0 with PyTorch 1.9.0 and CUDA 10.2. 
Additionally, we use Numba 0.54.0—a just-in-time (JIT) compiler for Python— 
to improve the Pyro performance for the Section 5.1 experiment. 

To aid the comparison between languages both in the text and in the figures, 
we use the (S), (M), and (G) symbols suffixed to PPL names to indicate if 
they run on single-core, multicore, or GPU, respectively. Despite the CUDA 
dependency for Pyro, we did not observe any GPU usage during Pyro SMC 
runs. In Pyro, SMC is a minor inference algorithm, with variational inference 
instead being the main focus. This may explain this lack of GPU support for 
SMC. Consequently, we classify SMC in Pyro as (M) and not (G). 

We ran all experiments on a machine with a 12-core (24 threads) Intel Xeon 
Gold 6136 CPU, 64 GB of memory, and an NVIDIA TITAN RTX GPU with 24 
GB of memory and 4608 CUDA cores. 


5.1 Experiment: Constant-Rate Birth Death 


In this experiment, we consider the non-trivial CRBD model described in Ron- 
quist et al. [36]. This model encodes the posterior distributions of the rates with 
which new evolutionary lineages arise (birth rate) and die out (death rate), con- 
ditioned on the input of a fixed evolutionary tree (phylogeny). We use the dated 
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Fig. 8: Execution times for the CRBD experiment, for different numbers of parti- 
cles N. The vertical line at the top of each bar indicates one standard deviation. 
PPLs with an (S) runs on a single core, (M) on multicore, and (G) on the GPU. 


Alcedinidae phylogeny (Kingfisher birds) referenced in Ronquist et al. [36], and 
introduced in Jetz et al. [23]. A notable feature of this model is that it contains 
recursive tree constructions, which are only expressible in universal PPLs. The 
CorePPL implementation of this model consists of 118 lines of code’. 

We measure execution time. To ensure fairness, we disabled variance-reducing 
techniques such as delayed sampling [28] and ESS-triggered resampling in all 
PPLs where available. Consequently, all implementations use precisely the same 
SMC inference algorithm. We checked this and the implementations’ correct- 
ness by considering the output normalizing constant estimates in all runs’. The 
variance and mean of these estimates were comparable for all PPLs. 

The results of the experiment are shown in Fig. 8 for three different numbers 
of SMC particles: 10000, 100000, and 1000000. We ran the PPL implementa- 
tions for 100 iterations (a number determined by available time and hardware) 
for each number of SMC particles. The exception to this is WebPPL (S) and 
Pyro (M), which we ran only for 10000 particles due to excessive execution 
times. For 10000 particles, WebPPL (S) ran for 55 seconds (standard deviation 
0.63 seconds), and Pyro (M) for 250 seconds (standard deviation 28 seconds). 
We omit WebPPL (S) and Pyro (M) from Fig. 8. Pyro relies heavily upon vec- 
torization through PyTorch, and the expensive operations in the CRBD model 
are recursive and stochastic tree constructions, which are difficult to vectorize. 
This explains the particularly abnormal execution times for Pyro (M). 

RootPPL is the best alternative in all categories. We conjecture that the 
difference compared to CorePPL is due to hand-tuned details in the RootPPL 
model. The RootPPL model uses efficient array encodings of the observed tree, 
precomputes the recursion order over this tree, and encodes it as an iterative pro- 
cedure. CorePPL instead compiles the tree as a tagged union type with pointers 
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Fig. 9: Execution times for the Vector-Borne Disease experiment, for different 
numbers of particles N. The vertical line at the top of each bar indicates one 
standard deviation. PPLs with an (S) runs on a single core, (M) on multicore, 
and (G) on the GPU. 


to subtrees in each node and traverses it via recursion. Automatically discovering 
this transformation from trees to arrays and recursion to iteration is non-trivial 
and not considered here but could have potential for future work. 

To improve the performance of Pyro, we also applied Numba to parallelize 
the recursive tree construction in the model manually. The parallelization we ap- 
ply is more fine-grained than the natural SMC particle parallelism and resulted 
in an order-of-magnitude performance boost over Pyro (M). Unlike CorePPL, 
RootPPL, and Birch, the execution times for Pyro/Numba (M) seems to grow 
sub-linearly when going from 100000 to 1000000 particles, as this only increases 
mean execution time from 6.72 seconds to 13.76. We conjecture that this is re- 
lated to the different type of parallelism introduced with Numba, in combina- 
tion with its JIT compilation. Therefore, looking at adding such parallelism to 
RootPPL and CorePPL is an interesting direction for future work. 


5.2 Experiment: Vector-Borne Disease 


Next, we consider the vector-borne disease model from Funk et al. [16], which 
is also studied further in Murray et al. [28]. This epidemiological model encodes 
a dengue outbreak in Micronesia and includes the spread of disease between 
mosquito and human populations. The inference is over the number of suscep- 
tible, exposed, infectious, and recovered (SEIR) individuals in the populations 
at discrete time steps (days), and the observations are daily numbers of re- 
ported new cases at health centers (the data is available in Funk et al. [16]). The 
CorePPL implementation of this model consists of 140 lines of code’. 

The experiment setup is identical to Section 5.1 but with fewer SMC particles 
due to more demanding computations in the model. Fig. 9 shows the results. We 
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Fig. 10: Execution times for the CRBD experiment with variance-reducing tech- 
niques for different numbers of particles N. The vertical line at the top of each 
bar indicates one standard deviation. PPLs with an (S) runs on a single core, 
(M) on multicore, and (G) on the GPU. Note the 6x speedup of RootPPL (M) 
over Birch (M) for N = 100000. 


omit WebPPL (S) entirely due to high execution times. However, we include Pyro 
(M) because the simple non-stochastic control-flow in this model allows much 
better vectorization than the CRBD model. The Numba optimization in Sec- 
tion 5.1 relied on the recursive structure of the model. We exclude Pyro/Numba 
(M) here, as such an optimization is not possible in this model. 

This time, CorePPL is the best option, by a small margin, over RootPPL. 
We conjecture that this is due to how RootPPL preallocates memory, which is 
instead dynamically allocated in CorePPL. This results in copying slightly more 
memory during resampling for this model in RootPPL. 

The difference between GPU and CPU for CorePPL and RootPPL is not as 
significant as in Fig. 8. We conjecture that this is due to the lower numbers of 
SMC particles used and RootPPL using different implementations for binomial 
distribution sampling on the CPU and GPU. The GPU uses a custom, and less 
efficient version, because the C+ + standard library binomial sampling imple- 
mentation is not available in CUDA. Because binomial sampling is the most 
expensive operation in this model, this can improve GPU performance further. 


5.3 Experiment: CRBD with Variance-Reducing Techniques 


In this experiment, we again consider the CRBD model from Section 5.1, but 
with delayed sampling and ESS-triggered resampling allowed. Also, we now con- 
sider a different, more challenging phylogeny of Tyrant flycatchers [36,23]. 

Fig. 10 shows the results. Other than the changes above, the setup is identical 
to Section 5.1. We added static delayed sampling manually to all models to 
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ensure fairness. Note, however, that automatic and dynamic delayed sampling, 
as introduced in Murray et al. [28], is also natively supported in Birch (but 
introduces some unfair overhead). CorePPL is omitted here, as adding efficient 
delayed sampling to the model is rendered more difficult by the current lack of 
support for mutable data structures. Based on the experiment in Section 5.1, 
WebPPL (S) and Pyro (M) are also not considered here. 

The results offer no surprise over Fig 8, and RootPPL is again the best 
alternative. Note the increased execution times here compared to Fig 8 due to 
the more challenging phylogeny and delayed sampling overhead (which is greatly 
compensated by increased inference accuracy). 


6 Related Work 


There are quite a few PPL implementations making use of SMC inference. Most 
closely related to the contributions in this paper is Birch [32]. Similarly to 
RootPPL, Birch implements SMC inference, and the target language for com- 
pilation is C++. However, while performance is one of the main goals with 
Birch, some overhead is inevitably introduced by supporting various quality-of- 
life C++ features—including automatic heap allocation [30] and object-oriented 
features. RootPPL does not support such features in favor of performance. Simi- 
larly to RootPPL, Birch supports CPU parallelism through the use of OpenMP. 
Compilation to GPUs is, however, currently not supported in Birch. 

The PCFG concept can also be related to Birch. In Birch, users write models 
for SMC inference as a method simulate which the inference algorithm calls 
iteratively. Resampling only occurs between calls to this method. Furthermore, 
data is passed between calls to simulate through particle variables stored in an 
object defined as part of the model (similar to the PCFG state). We can view 
PCFG basic blocks as a natural generalization of the Birch simulate method, 
conceptually allowing for many simulate methods with arbitrary control-flow 
in between them. In particular, SMC particles can take different paths through 
the PCFG. As with PCFG blocks, the explicit simulate function used in Birch 
can potentially make it more challenging to express models for programmers. 
This is not a problem when using our approach of compiling into PCFGs, as we 
then do the block decomposition automatically. 

Besides Birch, parallelism for SMC inference in PPLs is surprisingly absent 
in previous work. The predecessor of Birch, LibBi [29], is an exception to this 
and implements highly performant SMC inference through SIMD instructions, 
OpenMP, and CUDA. However, in contrast with RootPPL and CorePPL, the 
LibBi modeling language is not universal. In other words, LibBi can not express 
many probabilistic models. 

Pyro [10] is a PPL mainly focused on stochastic variational inference, sup- 
porting MCMC and SMC in addition. SMC in Pyro is similar to Birch in that 
models are constructed using an explicit step function (equivalent to simulate 
in Birch). In general, Pyro supports parallelism through vectorization using Py- 
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Torch [5] tensors, which is powerful but also restrictive. We saw this in Sec- 
tion 5.1, where we could not use Pyro tensors to parallelize the tree recursion. 

Other universal PPLs implementing SMC inference include WebPPL [20] 
and Anglican [40]. These languages are embedded in JavaScript, and Clojure, re- 
spectively, and implement several inference algorithms (including SMC) through 
CPS transformations. The focus is on ease of modeling through functional-style 
constructs supported by complex runtimes (V8 for JavaScript and the JVM 
for Clojure) and supporting many different inference algorithms. Parallelism for 
SMC is not directly supported, which is different from CorePPL and RootPPL, 
where the focus is parallelism and performance. 

Stan [12] and AugurV2 [22] support GPU parallelization of MCMC. Their 
modeling languages are, however, more restricted than CorePPL. Stan supports 
explicit parallelization of specific functions, and the AugurV2 compiler can com- 
pile to MCMC algorithms running partially in parallel on CUDA. This is quite 
different from the natural SMC parallelism in CorePPL and RootPPL. 

There are also many other probabilistic programming tools, libraries, and 
languages available, for instance, Gen [13], Turing [17], Hakaru [34], and Ed- 
ward [38]. Generally, these either focus on assisting users in manually construct- 
ing inference algorithms tailored for their specific models or on providing efficient 
inference for a restricted set of models. 


7 Conclusion 


This paper introduced the concept of PCFGs and a general method for compil- 
ing universal PPLs to PCFGs. We illustrated these contributions further through 
the RootPPL implementation and the CorePPL compiler. This is the first work 
compiling a universal PPL to GPU with SMC inference. Furthermore, the evalua- 
tion showed that CorePPL and RootPPL can deal with real-world SMC inference 
problems and outperform the current state-of-the-art with up to 6x speedups 
for challenging models (and even more when compared across CPU and GPU). 
This gives strong empirical support for the usefulness of the contributions. 

Possible improvements upon this work include the exploration of more com- 
plex CUDA and C++ runtimes for RootPPL, e.g., runtimes with automatic 
memory management through garbage collection. Additionally, high-performance 
implementations similar to RootPPL for other inference methods (e.g., MCMC) 
are highly relevant for many probabilistic models—for instance, various models 
from phylogenetics [36]. We leave these topics for future work. 
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Abstract. Quantitative separation logic (QSL) is an extension of sep- 
aration logic (SL) for the verification of probabilistic pointer programs. 
In QSL, formulae evaluate to real numbers instead of truth values, e.g., 
the probability of memory-safe termination in a given symbolic heap. As 
with SL, one of the key problems when reasoning with QSL is entatlment: 
does a formula f entail another formula g? 

We give a generic reduction from entailment checking in QSL to entail- 
ment checking in SL. This allows to leverage the large body of SL research 
for the automated verification of probabilistic pointer programs. We an- 
alyze the complexity of our approach and demonstrate its applicability. 
In particular, we obtain the first decidability results for the verification 
of such programs by applying our reduction to a quantitative extension 
of the well-known symbolic-heap fragment of separation logic. 


1 Introduction 


Separation logic [29] (SL) is a popular formalism for Hoare-style verification of 
imperative, heap-manipulating and, possibly, concurrent programs. Its assertion 
language extends first-order logic with two connectives—the separating conjunc- 
tion x and the magic wand —+—that enable concise specifications of how pro- 
gram memory, or other resources, can be split-up and combined. SL builds upon 
these connectives to champion local reasoning about the resources employed 
by programs. Consequently, program parts can be verified by considering only 
those resources they actually access—a crucial property for building scalable 
tools including automated verifiers [46,12,16,44,31], static analyzers [10,24,14], 
and interactive theorem provers [32]. At the foundation of almost any automated 
approach based on SL, lies the entailment problem ọ | w: are all models of SL 
formula y also models of SL formula Y? For example, Hoare-style verifiers need 
to solve entailments whenever they invoke the rule of consequence, and static 
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analyzers ultimately solve entailments to perform abstraction. While undecid- 
able in general [1], the wide adoption of SL and the central role of the entailment 
problem have triggered a massive research effort to identify SL fragments with 
a decidable entailment problem [11,17,21,22,27,28,35,40,47,18,20], and to build 
practical entailment solvers [46,12,16,50]. 

Probabilistic programs, that is, programs with the ability to sample from prob- 
ability distributions, are an increasingly popular formalism for, amongst others, 
designing efficient randomized algorithms [42] and describing uncertainty in sys- 
tems [23,15]. While formal reasoning techniques for probabilistic programs exist 
since the 80s (cf., [37,38,49]), they are rarely automated and typically target only 
simplistic programming languages. For example, verification techniques that sup- 
port reasoning about both randomization and data structures are, with notable 
exceptions [51,9], rare—a surprising situation given that randomized algorithms 
typically rely on dynamic data structures. 

Quantitative separation logic (QSL) is a weakest-precondition-style verifica- 
tion technique that targets randomized algorithms manipulating complex data 
structures; it marries SL and weakest preexpectations [43]—a well-established 
calculus for reasoning about probabilistic programs. In contrast to classical 
SL, QSL’s assertion language does not consist of predicates, which evaluate to 
Boolean values, but expectations (or: random variables), which evaluate to real 
numbers. QSL has been successfully applied to the verification of randomized 
algorithms, and QSL expectations have been formalized in Isabelle/HOL [26]. 
However, reasoning is far from automated—mainly due to the lack of decision 
procedures or solvers for entailments between expectations in QSL. 

This paper presents, to the best of our knowledge, the first technique for 
automatically deciding QSL entailments. More precisely, we reduce QSL quanti- 
tative entailments to classical entailments between SL formulas. Hence, we can 
leverage two decades of separation logic research to advance QSL entailment 
checking, and thus also automated reasoning about probabilistic programs. 


Contributions. We make the following technical contributions: 


— We present a generic construction that reduces the entailment problem for 
quantitative separation logic to solving multiple entailments in fragments 
of SL; if we reduce to an SL fragment where entailment is decidable, our 
construction yields a QSL fragment with a decidable entailment problem. 

— We provide simple criteria for whether one can leverage a decision procedure 
or a practical entailment solver for SL to build an entailment solver for QSL. 

— We analyze the complexity of our approach parameterized in the complexity 
of solving entailments in a given SL fragment; whenever we identify a decid- 
able QSL fragment, it is thus accompanied by upper complexity bounds. 

— We use our construction to derive the QSL fragment of quantitative symbolic 
heaps for which entailment is decidable via a reduction to the Bernays- 
Sch6nfinkel-Ramsey fragment of SL [20]. 


Outline. Section 2 introduces (quantitative) separation logic. Section 3 motivates 
our approach by providing the foundations for probabilistic pointer program 
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verification with QSL together with several examples. We present the key ideas 
and our main contribution of reducing QSL entailment checking to SL entailment 
checking in Section 4. We analyse the complexity of our approach in Section 5. 
In Section 6, we apply our approach to obtain the first decidability results for 
probabilistic pointer verification. Finally, Section 7 discusses related work and 
Section 8 concludes. 

Detailed proofs are found in an extended version of this paper [7]. 


Table 1. Metavariables used throughout this paper. 


Entities Metavariables Domain 
Natural numbers n, i, j, k N 
Rational probabilities P,q,&, B, 7,6 P 
Programs C hpGCL 
Stacks s Stacks 
Heaps h Heaps, 
Variables L,Y, Z Vars 
Values v, wW Vals 
Locations £ Locs 
Predicates p P (States) 
one-bounded expectations X E<ı 

SL formulae p,p, ù SL [-] 
Pure formulae T 

QSL formulae f.g,u, I QSL [-] 


2 (Quantitative) Separation Logic 


2.1 Program States 


Let Vals be a countably infinite set of values, and let Vars be a countably infinite 
set of variables with domain Vals. The set of stacks is given by 


Stacks = { s | s: Vars > Vals} . 


Let Locs C Vals be an infinite set of locations. We denote locations by £ and 
variations thereof. We fix a natural number k > 1 and a heap model where finite 
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Table 2. Semantics of SL [2{] formulae. 


p s,h) = ¢ iff 

V s,h) € [o] 

m s, h) E Y 

WAV s,h) Ew and (s,h) = 0 

pV s, h) Ey or (s, h) H 

Jz: y s [x:=v] , h) | w for some v € Vals 

Vr: yp s [x:=v] , h) = w for all v € Vals 

Wed s, hı) H w and (s, h2) H ù for some hi x h2 = h 
p — ù s,hxh’) = V for all h’ L h with (s, R’) Ew 


sets of locations are mapped to fixed-size records over Vals of size k. Put more 
formally, the set of heaps is given by 


Heaps, = {h | h: L Vale, LC Locs, |L| < o}. 
The set of program states is then given by 


States = {(s,h) | s € Stacks, h € Heaps, } . 


Given a program state (s,h) and an expression t over Vars, we denote by t(s) 
the evaluation of expression t in s, i.e., the value that is obtained by evaluating 
t after replacing any occurrence of any variable x € Vars in t by the value s(x). 
We write s [£:=v] to indicate that we set variable x to value v € Vals in s, i.e.* 


9 


v, ify=r 
sty), ify Aca. 
For heap h, h[€:=(v1,...,U)] is defined analogously. For a given heap h: L > 
Vals", we denote by dom (h) its domain L. Two heaps hy, hz are disjoint, denoted 


hı L ho, if their domains do not overlap, i.e., dom (h1) N dom (h2) = Ø. The 
disjoint union of two disjoint heaps hy: Li > Vals" and hg: Le > Vals“ is 


er if £ € dom (h1) 


s|r:=v] = Ay. 


hy x hg: dom (h1) Ù dom (hz) > Vals“, (hi x h2)(¢) = ho(é), if £€ dom (ha) 
2 ’ 2)- 


2.2 Separation Logic 


A predicate ® € P (States) is a set of states. A predicate @ is called pure if it 
does not depend on the heap, i.e, for every stack s and heaps h,h’, we have 
(s,h) E€ @ iff (s,h’) E€ D. 


4 We use \-expressions to denote functions: Function AX. f applied to an argument 
v evaluates to f in which every occurrence of X is replaced by v. 
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We consider a separation logic SL [2l] with standard semantics [48]. A distin- 
guishing aspect is that SL [X] is parametrized by a set 21 of predicate symbols 
w with given semantics |y] € P (States). We often identify predicate symbols Y 
with their predicates [y]. Elements of 2 build the atoms of SL [2l]. Our reduc- 
tion from quantitative entailments to qualitative entailments does not depend 
on the choice of these predicate symbols. We therefore take a generic approach 
that allows for user-defined atoms, e.g., list or tree predicates. 


Definition 1. Let 2 be a countable set of predicate symbols. Formulae in sepa- 
ration logic SL [XA] with atoms in XA adhere to the grammar 


gp > |=| prg | pve | Ir: | Yz: | pxy | p — o, 
where 3) € A, and where x € Vars. A 


The Boolean connectives ~, A, and V as well as the quantifiers 3 and V are 
standard. x is the separation conjunction and —+ is the magic wand. 

The semantics |y] € P (States) of a formula y € SL [X4] is defined by induction 
on the structure of y as shown in Table 2. Recall that we assume the semantics 
[y] of predicate symbols y% € 2 to be given. We often write (s, h) = ọ instead 
of (s,h) € [y]. For y,w € SL [2], we say that y entails p, denoted y H y, if 
whenever (s, h) € States such that (s, h) H ọ, also (s, h) H 4%. 


Example 1. Let Vals = Z, Locs = Nyo, and k = 1. A term t is either a variable 
x € Vars or the constant 0 € Vals. The set XA of predicate symbols is 


A = {true,emp,ct,t=t',t A’, \s(t,t’) | x € Vars,t,t’ terms } 
Here, apart from standard predicates for true, equalities, and disequalities, 
1. emp is the empty-heap predicate, i.e., 
(s,h) Hemp iff dom(h)=0, 
2. x> tis the points-to predicate, i.e., 


(s,h) Fant iff dom(h) = {s(x)} and h(s(x)) = t(s) , 


3. the list predicate |s(t,t’) asserts that the heap models a singly-linked list 
segment from t to t: 
(s,h) E Is(t,t') 
iff dom (h) =Q and t(s) = t(s) or 
there exist n > 1 and terms t1,...,tn with tn = t’ such that 


(s, h) = te ti *...xtn-1 > tn. 


In this setting, SL [A] contains, e.g., the well-known symbolic heap fragment of 
separation logic with lists. For instance, the SL [X] formula 


dy: dz: x> yxy zxlis(z,0) . 


asserts that the heap consists of a list with head x of length at least 2. A 
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Table 3. Semantics of QSL [XA] formulae. 


h) 
»h) - [g] (s, h) + [7] (s, h) - [ul (s, h) 
q-g+(l—q)-u q: [g] (s,h) + G- q) : [u] (s, 4) 


gu [9] (s, h) - [ul (s, h) 

1-g 1 — [g] (s, h) 

gmaxu max{ [g] (s, 2), [u] (s, 2)} 

gminu min{[g] (s, 2), [u] (s, h)} 

Cr: g max { [g] (s [x:=v],h) | v € Vals} 

lz: g min { [g] (s [x:=v], h) | v € Vals} 

gxu max {[g] (s, 1) - [u] (s, h2) | k = hı x h2} 

lY] — g inf {[g] (s,hxh’) | h’ L hand [4] (s, h) = 1} 


2.3 Quantitative Separation Logic 


In quantitative separation logic [9,39], formulae evaluate to non-negative real 
numbers or infinity instead of truth values. By conservatively extending the 
weakest preexpectation calculus by McIver & Morgan [41], this enables the com- 
positional verification of probabilistic pointer programs by reasoning about ex- 
pected list-sizes, probabilities of terminating with an empty heap, and alike. 

We consider here a fragment of quantitative separation logic suitable for rea- 
soning about the likelihood of events in probabilistic pointer programs such as, 
e.g., the probability of terminating in a given symbolic heap. The formulae we 
consider evaluate to rational probabilities rather than arbitrary reals or infinity. 
We denote the set [0, 1] Qso of rational probabilities by P. Like SL [2], quanti- 
tative separation logic is parameterized by a set 2 of predicate symbols ~ with 
given semantics [y] € P (States), building the atoms of QSL [2{]. 


Definition 2. Let 2 be a countable set of predicate symbols. Formulae in quan- 
titative separation logic QSL [X] with atoms in A adhere to the grammar 
f > Bl | lege aay) eer hee fe | gee 
| 1—f | fmaxf | fminf | 2z: f | da: f 
where Yy, r E€ A with t pure, q € P, and where x € Vars. A 


The semantics of a formula f € QSL [A] is a (one-bounded) expectation. The set 
#<1 of one-bounded expectations is defined as 


fey = {X | X: States — [0,1]} . 
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We use the Iverson bracket [30] notation [] to associate with predicate ® its 
indicator function. Formally, 


[Ð]: States — {0,1}, [8] (s,h) = fs nee 

0, if(s, h)g®. 
Given a predicate symbol ~, we often write [y] instead of |[y]]. The semantics 
Lf] € E<1 of f € QSL [Al] is defined by induction on the structure of f in Table 3. 
We write f = g if f and g are equivalent, i.e. if [f] = [g]. Infima and suprema are 
taken over the complete lattice ([0, 1], <). In particular, inf Ø = 1 and sup@ = 0. 


Theorem 1. The semantics of QSL [2] formulae is well-defined, i.e., for all 
f € QSL [Ql], we have [f] € E<1. 


Proof. By induction on the structure of f. 


Let us go over the individual constructs. Formulae of the form [y] are the atomic 
formulae. [r] +g + [>r] - u is a Boolean choice between g and u that does not 
depend upon the heap since [r] is pure. q- g+ (1-— q) -u is a convex combination 
of g and u. g-u is the pointwise multiplication of g and u. 1—g is the quantitative 
(or probabilistic) negation of g. gmax u and g minu is the pointwise maximum 
and minimum of g and u, respectively. 

ex: gis the supremum quantification that, given a state (s, h), evaluates to 
the supremum of the set obtained from evaluating g in (s [|x:=v],h) for every 
value v € Vals. In our setting, this supremum is actually a maximum. Dually, 
Cx: gis the infimum quantification. 

x and —+ are the quantitative analogous of the separating conjunction and 
the magic wand from separation logic as defined in [9]. g xu is the quantitative 
separating conjunction of g and u. Intuitively speaking, whereas the qualitative 
separating conjunction maximizes a truth value under all appropriate partition- 
ings of the heap, the quantitative separating conjunction maximizes a probability. 
[Y] — u is the quantitative magic wand. Whereas the qualitative magic wand 
minimizes a truth value under all appropriate extensions of the heap, the quan- 
titative magic wand minimizes a probability. For an in-depth treatment of these 
connectives, we refer to [9]. 


Example 2. Let Vals, Locs, k, and 2l be as in Example 1. Then QSL [X] contains, 
e.g., a quantitative extension of the symbolic heap fragment of separation logic 
with lists. For instance, the QSL [X] formula 


0.7- (8y: 8z: [x > yl] x [yr z] x [Is(z, 0)]) + 0.3 - [emp] 


expresses that with probability 0.7 the heap consists of a list with head x of 
length at least 2 and that with probability 0.3 the heap is empty. A 


Finally, given f,g € QSL [2], we say that f entails g, denoted f = g, if 


for all (s, h) € States: [f](s,h) < [g] (s,h). 
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Quantitative entailments f | g generalize classical entailments in the sense that 
f (pointwise) lower-bounds the quantity g. For example, if g assigns to each 
state the probability that some program C terminates without a memory error, 
then the entailment [true] } g means that C terminates almost-surely, i.e., with 
probability one. Our problem statement now reads as follows: Reduce entailment 
checking in QSL [A] to checking finitely many entailments in SL [X]. 


3 Entailments in Probabilistic Program Verification 


Our primary motivation for studying the entailment problem for quantitative 
separation logic is to provide foundations for the automated verification of proba- 
bilistic pointer programs. In this section, we consider examples of such programs 
written in hpGCL—an extension of McIver & Morgan’s probabilistic guarded 
command language (cf., [41]) by heap-manipulating instructions— and the en- 
tailments that arise from their verification. We briefly formalize reasoning about 
hpGCL programs with weakest liberal preexpectations; for a thorough introduc- 
tion of hpGCL programs and techniques for their verification, we refer to [9,39]. 


3.1 Heap-manipulating pGCL 


Recall from Section 2.1 that heaps map memory locations to fixed-size records (or 
tuples) of length k > 1. The set of programs in heap-manipulating probabilistic 
guarded command language for k = 1, Vals = Z and Locs = Nyo, denoted hpGCL, 
is given by the grammar 


C —> skip (effectless program) 
z:= E (assignment) 
{C}[pl{0} (prob. choice) 
C; C (seq. composition) 
if (B){C} else{C} (conditional choice) 
while(B){C} (loop) 
x := new (E) (allocation) 
free(E), (disposal) 
vi=<E> (lookup) 
<E>:=E' (mutation) 


where x € Vars, p € P, E,E’ are arithmetic expressions and B is a Boolean 
expression. We assume that expressions do not depend on the heap. For now, 
we do not fix a specific syntax for expressions but assume evaluation mappings 


E: Stacks > Z and B: Stacks —> {true, false} . 


In addition to the usual control flow structures for sequential composition, con- 
ditionals, and loops, skip does nothing, x := E assigns the value E(s) obtained 
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Table 4. Rules for compositionally computing weakest liberal preexpectations. Here, f 
is a QSL [X] formula representing the postexpectation. f [v:=E] denotes the substitution 


of every free occurrence of x by E in f. [EH —] desugars to @z: [E > 2]. 
Cc wip[C] (F) 
skip f 
xz:= E f [z:=E] 
{Ci} [p] {C2} p-wlp[Ci] (F) + G — p) - wiplC2] (F) 
C1; C2 wlp[Ci] (wlp[C2] (f)) 
if (B){Ci}else {C2} — [B]-wlp[Ci] (f) + [>B] - wip[C2] (f) 
x := new (E) Cy: [y = E] — f [x:=y] 
free(E) Em -]x*f 
x:=<E> 2y: [E> y] x ([E => y] — f [z :=y]) 
<E>:=E' E |œ —]x ([E > E'] — f) 


from evaluating expression E in the current program state (s, h) to x, and the 
probabilistic choice {Ci} [p] {C2 } flips a coin with bias p—it executes C4 if 
the coin flip yields heads, and C2 otherwise. The allocation x := new (E) non- 
deterministically selects a fresh location, stores it in x, and puts a record with 
value E on the heap at that location. Since we assume an infinite address space, 
allocation never fails. Conversely, free(Æ) disposes the record at location E 
from the heap; it fails if no such location exists. The mutation < E> := F’ and 
the lookup x :=< E> update to E’ resp. assign to x the value stored at location 
E; both statements fail if the heap contains no such location. 


3.2 Weakest Liberal Preexpectations 


We formalize reasoning about hpGCL programs in terms of the weakest liberal 
preexpectation transformer wlp: hpGCL —> (QSL [2] —> QSL [2l]), where 2 at 
least contains formulae of the form [E +> E’]; Table 4 summarizes the rules for 
computing wlp of loop-free programs on the program structure. 

Conceptually, the weakest liberal preexpectation [wlp[C] (f)] (s,2) of pro- 
gram C with respect to posterpectation f € QSL [Ql] on (s,h) is the least ex- 
pected value of [f] (measured in the final states) after successful? termination 
of C on initial state (s,h), plus the probability that C does not terminate on 
(s, h). Adding the non-termination probability can be thought of as a partial cor- 
rectness view: we include the non-termination probability of C on state (s, h) in 
the wlp of C just as we include the state (s, h) in the weakest liberal precondition 
of C in case C does not terminate on (s, h). 


5 i.e., without encountering a memory error. 
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A reader familiar with separation logic will realize the close similarity be- 
tween the rules in Table 4 and the weakest preconditions for SL by Ishtiaq and 
O’Hearn [29]. The main differences are (1) the use of the quantitative connec- 
tives x, —+, and -, and +, and (2) the additional rule for probabilistic choice, 
wlpl{ Ci} [p] {C2 }] (f), which is a convex sum that weights wlp[C,] (f) and 
wlp[C2] (f) by p and (1 — p), respectively. 

The transformer wlp is well-defined in the sense that, for every loop-free 
hpGCL-program and every QSL [X] formula, we obtain—under mild conditions— 
again a QSL [2l] formula: 


Theorem 2. Let C € hpGCL be loop-free and A be a set of predicate symbols. If 


1. A contains the points-to predicate for all variables and all expressions occur- 
ring in allocation, disposal, lookup and mutation in C, 

2. A contains all guards and their negations occurring in C, and 

3. all predicates in A are closed under substitution of variables by variables and 
arithmetic expressions occurring on right-hand sides of assignments in C, 


then, for every QSL [A] formula f, wlp[C] (f) € QSL [2]. 
Proof. By induction on loop-free C. 


For loops, wip[while(B){C}](f) is typically characterized as the greatest 
fixed point of loop unrollings. However, we fixed an explicit syntax of formu- 
lae instead of allowing arbitrary expectations; the above fixed point is in general 
not expressible in our syntax.®° To deal with loops, we thus require a user-supplied 
invariant I and apply the following proof rule (cf., [34]) to approximate wlp: 


IE [-B|-f+[B]-wlp[C’] (I) implies J | wlpfwhile(B) {C’}] (f) 


Notice that verifying that I is indeed an invariant via the above rule requires 
proving an entailment between QSL [X] formulae. 


3.3 Interfered Swap 


Our first example concerns a program Cywap, implemented in hpGCL below, that 
attempts to swap the contents of two memory locations x and y. However, since 
variable x is shared with a concurrently running process, writing to x can be un- 
reliable, that is, instead of the intended value, the concurrently running process 
may write a corrupted value err into memory with some probability, say 0.001. 
A similar situation occurs, e.g., when using the protocol described in [2]. 


Cswap: tmpl :=<a2>; 
tmp2 :=<y>; 
{<a> :=tmp2} [0.999] {<a> :=err} ; 
<y>:=tmpl. 


6 It is noteworthy that a sufficiently expressive syntax for weakest preexpectation 
reasoning without heaps has been developed only recently [8]. 
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We can use wlp to verify an upper bound on the probability that an erroneous 
write operation happened by solving the QSL entailment 


wlp[Cswap] ([2 = 22] * [y > z1]) 
E [zo = err] - ([£ > 21] * [y > 22]) + [zo A err] - (0.999 - ([a > 21] * [ly > 2z2])) . 


That is, the probability that Cswap successfully swaps the contents of x and y is 
at most 0.999 if y does initially not point to the corrupt value err. 

As we will see in Section 6.1, our approach for solving QSL entailments is 
capable of deciding the above entailment, where wlp[Cswap] ([£ => z2] x [y > 21]) 
is computed according to the rules in Table 4. 


3.4 Avoiding Magic Wands 


Recall from Table 4 that computing wlp introduces a magic wand (—+) for 
almost every statement that accesses the heap. This is unfortunate because many 
decidable separation logic fragments as well as practical entailment solvers do 
not support magic wands. 

In particular, in Section 6.1 we present a QSL fragment with a decidable 
entailment problem that supports magic wands only on the left-hand side of 
entailments. Hence, proving a lower bound on the probability that the program 
Cswap from above successfully swapped the contents of two memory cells, e.g., 


0.98 - ([z > z2] * [y > 21]) E wlpfCswap] (le > zı] x [y > 2));, © 


might still be possible with our technique but requires a different separation 
logic fragment to reduce to. 

Fortunately, we can often avoid introducing magic wands by employing lo- 
cal reasoning and rules for computing wlp for specific pre- and postexpecta- 
tions. In particular, the wlp calculus features (1) the frame rule from sepa- 
ration logic, i.e., if no free variable in g is modified by C, then wlp[C] (f) « 
= wlp[C] (f xg), (2) super-distributivity for convex combinations and maxi- 
mum, i.e., q: wlp[C] (F) + (1 — 4) - wlp[C] (g9) = wlp[C] (4: f + (1 — 4) : g9) and 
wlp[C] (f) max wlp[C] (g) | wlp[C] (f max g), and (3) monotonicity, i.e., f = g 
implies wlp[C] (f) H| wlp[C] (g). Moreover, we give four examples of specialized 
rules that avoid magic wands but require specific postexpectations: if x is not a 
free variable of E or f, and x and y are distinct variables, then 


G) wple := <B>] (E > y): [e = y1)» f) = [E H a) fled ; 
(ii) wip[< E> := EJ ([E E']*f) = [E> —]«f: 
(iii) wlpļæ :=new(x)] (Cy: [x > y] * f) = f [y:=a] ; and 
(iv) A f 


Similar rules have been used successfully for symbolic execution with separation 
logic in non-probabilistic settings [13]. Combining the above rules with fram- 
ing, distributivity, and monotonicity often allows avoiding magic wands. In such 
cases, we have a richer set of decidable SL fragments upon which to build solvers 


Ka; 


wlpļx := new (y) 
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for QSL entailments at our disposal. Coming back to the entailment (t) from 
above and writing Cswap = C1; C2; C3; C4, we calculate 


wlp[Cswap] ([£ — 21] * [y > z2]) 
= wlp[Cywap] (ly > tmp1] x [x + tmp2] - [tmp1 = z2] - [tmp2 = z]) 
(monotonicity) 
= wlp[Cy; C2; C3] (wlp[C4] ([y > tmp1]) (framing) 
x ([x => tmp2] - ([tmp1 = 22] - [tmp2 = z,]))) 
=| wlp[Ci; C2; C3] (ly —] * (fe = tmp2] - ([tmp1 = 22] - [tmp2 = 21]))) 
(Rule (ii)) 


= wlp[C i] (0.999 - ([y > 21] x ([tmp1 = 22] - [xz => —])) + 0.001 - [false]) 
(Rule (i)) 


= 0.999 - wlp[Cy] (([2 22] - [tmp1 = z2]) x [y > 21]) + 0.001 - [false] 
(super-distributivity, monotonicity and commutativity) 


= 0.999 - ([x = zə] * [y > 211) + 0.001 - [false] (Rule (i)) 


which yields a preexpectation without magic wand. Hence, we obtain a magic 

wand-free entailment in (t). We have used our technique to transform this quan- 
titative entailment into several qualitative entailments and checked them success- 
fully using the separation logic extension of CVC4 [47]. Detailed calculuations, 
the resulting qualitative entailments, and the input for CVC4 in SMT-LIB 2 
format are found in the extended version [7]. 


3.5 Randomized List Population 


Our second example populates a singly-linked list by flipping coins and adding 
a list element until the coin flip yields heads, i.e., we consider the program 


Chopulate : while(c #0) { 
{c:=0} [0.5] {a := new (x) } 
}, 


where z is the head of a linked list. Assume we would like to determine a lower 
bound on the probability that the above program does not crash and produces a 
list of length at least two’. For that, recall from Example 1 the separation logic 
formula Is(x, y) for singly-linked list segments. The aforementioned probability 
is then given by wlp[Cpopulate] (f) for postexpectation 


f = ey: ez: [xy ylxlyr 2] x [ls(z,0)] . 


T plus the probability of nontermination, which is 0. 
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We propose the loop invariant I below to show that J = wlp[Cpopulatel (f), i.e., 
I is a lower bound on the sought-after probability. 


I = 2y: [x y]«([c=0)-22: [y= z] [Is(z, 0)] 
+ [c # 0] - 1/2- (8z: [yr 2] x [Is(z,0)] + 1/2- [Is(z,0)])) ; 
To verify that I is indeed a loop invariant (hint: it is), we need to prove that 
IE [c=0]- f+ [c40]-wlpf{c :=0} [0.5] {x :=new(x) H). 


As described in Section 3.4, we can compute wlp in a way such that the resulting 
formula contains no magic wands. Our reduction from QSL entailments to stan- 
dard SL entailments then allows us to discharge the above invariant check using 
existing separation logic solvers with support for fixed list predicates, e.g., [46]. 


4 Quantitative Entailment Checking 


We present our main contribution of reducing entailment checking in QSL [X] to 
entailment checking in SL [A]. We consider the key observations leading to our 
reduction in Section 4.1. We then deal with the formalization and more technical 
considerations of our approach in Sections 4.2 and 4.3. 


4.1 Idea and Key Observations 
We reduce entailment checking in QSL [X] to entailment checking in SL [X], i.e., 


Given f,g € QSL [Ql], we reduce checking f H g to checking finitely many 
entailments of the form ọ = w with y,w € SL [Qi]. 


We instantiate QSL [XA] and SL [2l], respectively, for the sake of concreteness. For 
that, we fix the set X of predicate symbols given by 


A = {true, emp, c=y, cr 4y, cH y | x,y E€ Vars}. 


Now, consider the following entailment u = uz as a running example: 


uy = 0.4: ([x > y] x [y > 2) + 0.6: [z > y] = 0.6- ([x +> y] x [true]) = wo. 


Intuitively speaking, uı expresses that with probability 0.4 the heap consists 
of two cells where x points to y and separately y points to z, and that with 
probability 0.6 the heap consists of a single cell where x points to y. Formula u2 
expresses that with probability 0.6 the heap contains a cell where x points to y. 
How can we reduce the problem of checking whether ui = ug holds to checking 
finitely many entailments in SL [A]? We rely on two key observations: 


Observation 1. For every f € QSL [2], the set 
Eval(f) = {[f](s,2) | (s,h) € States} c P 


is finite. Moreover, there is an effectively constructible finite and sound overap- 
proximation Val [f] of Eval (f), i.e., Eval (f) C Val [f]. 
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Example 3. Consider the expectation uı from our running example: We have 
Eval (ui) = {0,0.4, 0.6}. We construct a finite overapproximation of Eval (u1) as 
follows: First, we observe that both subformulae gı and gz evaluate to a value 
in {0,1}, ie, Val [gi] = Val [g2] = {0,1}. From Val [gi] and Val [g2], we obtain a 
finite overapproximation Val [u1] of Eval (u1) given by 


Val [u] = {0.4-a+0.6-6 | ae Val[gı], 8 € Val[gə]} = {0,0.4,0.6, 1} . 
Notice that Val [u1] is a proper superset of Eval (u1) since 1 ¢ Eval (u1). A 
We consider the construction of Val [f] for arbitrary f € QSL [X] in Section 4.2. 
Observation 2. Given f € QSL [2] and a probability a € P, there is an effec- 


tively constructible SL [2l] formula, which we denote by [a < f], such that (s, h) 
is a model of [a < f| if and only if f evaluates at least to a on state (s, h), i.e., 


(sh) = flas f] iff a < [f](s,h) . 
n—#“#_|_ —__“__’ —— a 
in SL[2] in QSL[2] 


We can thus lower bound QSL [2] formulae in terms of SL [X] formulae. 


Example 4. Continuing our running example, we construct [0.5 < u1], i.e., an 
SL [XA] formula evaluating to true on state (s,h) if and only if u, evaluates at 
least to 0.5. We start by considering the subformulae of u1. Since both gı and 
g2 embed SL [XA] predicates, we have for every a € P 


[a < gı] = true if a = 0 else zr >œ yxy} z 
and [a < g2] = true if a = 0 else z > y . 


The intuition is as follows: œ = 0 lower bounds every probability. Conversely, if 
a > 0 then a lower bounds gı (resp. g2) on state (s, h) if and only if (s, h) satisfies 
the predicate gı (resp. g2). Now, when does u; evaluate at least to 0.5? Given 
Val [gi] and Val [g2] and the fact that the valuation of u; is a convex combination 
of the valuations of gı and g2, there are (at most) two cases: Either both gı and 
g> evaluate to (at least) 1, or gz (but not necessarily gı) evaluates to (at least) 1. 
Given [1 < gı| and [1 < ge], the aforementioned informal disjunction translates 
to a formal disjunction in SL [2]: 


[0.5 < u] = (f1 3 g] ATi x 91) V TL 3 92] 
(z> yxy z) Arey) Vry. 


Notice that—as it is the case for Val [u1] —we construct [0.5 < ui] syntactically. 
In particular, we disregard that the disjunct (x œ> yxy > z) Az} yis 
unsatisfiable and therefore equivalent to false. A 


We provide the construction of [a < f| for arbitrary QSL [A] formulae f— 
including quantitative quantifiers and the magic wand—in Section 4.3. 


Finally, Observations 1 and 2 together yield our reduction from f |} g to 
finitely many entailments in SL [2]. Intuitively speaking, we formalize that 
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Table 5. Inductive definition of Val [f]. 


f € QSL [2] Val [f] c P 

[y] {0,1} 

[r] -g + [r] u Val [g] U Val [u] 
eg (kg) ou p: Val [g] + (1 — p) - Val [u] 
g-u Val [g] - Val [u] 
l-g 1 — Val [g] 
gmaxu Val [g] max Val [u] 
gminu Val [g] min Val [u] 
2x: g Val [g 

lz:g Val [g 

gxu Val [g] - Val [u] 
[Y] — 9 Val [g 


whenever f evaluates at least to a, then g too evaluates at least to a 


equivalently in terms of finitely many SL [2] entailments. Put more formally, 
since Val [f] is finite, we have 


Leg 
iff for all (s,h): [f] (s,2) < lg] (s,h) (by definition) 
iff for all (s,h) and all a € Val [f]: a < [f] (s, h) implies a < [g] (s, R) 
(by Observation 1) 
iff for all (s,h) and all a € Val [f]: (s,h) E [a < f| implies (s,h) = [a < g] 
(by Observation 2) 
iff for alla € Val[f]: fax f] Elaxg]. (by definition) 
Example 5. Reconsider our running example. Since |Val[ui]| = 4, the QSL [X] 


entailment uj, 


= ug is equivalent to the four entailments 


faxui] = fa < u2] for a € {0,0.4,0.6, 1} 


in SL [X], each of which actually holds. 


4.2 Constructing Finite Overapproximations of Eval (f) 


We consider the formal construction underlying Observation 1 from the previous 
section, i.e., given f € QSL [2], we provide a syntactic construction of a finite 
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overapproximation Val [f] of Eval(f). This construction is by induction on the 
structure of f as shown in Table 5. For that, we define some shorthands. Given 
a € P, V,W CP, and a binary operation o: P x P > P, we define 


a-V = {a-B | BEV} and VoW = {Boy | BEV, YEW}. 


Let us now go over the individual cases. 
The case f = [Y]. We have [1] (s,h) € {0,1} by definition. 


The case f = [n] - g + [þr] - u. For every (s,h), the formula f either evalu- 
ates to |g] (s,h) or to [ul (s, h), depending on whether (s,h) = m holds. 


The case f = p-g+(1—p)-u. The formula f evaluates to p-a+(1-—p)- 8 for 
some a € Val [g] and 6 € Val [u]. 


The case f = g-u or f = gxu. The formula f evaluates to a- 8 for some 
a € Val [g] and 6 € Val [u]. 


The case f = 1 — g. The formula f evaluates to 1 — a for some a € Val [g]. 


The case f = go u foro € {max, min}. Since max and min are defined point- 
wise, the formula f evaluates to some value ao 8 for a € Val [g] , 8 € Val [u]. 


The case f = Gx: g or f = Ca: g. Since Val [g] overapproximates the set of 
all valuations of g, quantitative quantifiers do not add any valuation. 
The case f = [Y] — g. Recall that 

[fl (s,h) = inf{fg] (shx h) | h Lh and [y](s,h’) = 1} . 
If the above set is non-empty, the infimum is actually a minimum and therefore 
f evaluates to some value in Val [g]. If the above set is empty, then [f] (s,h) = 1. 
It is easy to verify that 1 is necessarily an element of Val [g] (cf., [7, Lemma 4]). 
Summarizing our considerations on Val [f], we get: 


Theorem 3. For every f € QSL [Q], the effectively constructible set Val [f] c P 
given in Table 5 satisfies 


|Val[f]| <œ and Eval(f) C Val[f] . 


Proof. Straightforward by induction on f. 


4.3 Lower Bounding QSL [XA] by SL [XN] Formulae 


We now consider the formal construction underlying Observation 2 from Sec- 
tion 4.1. That is, given f € QSL [X] and a € P, we provide the syntactic con- 
struction of an SL [2l] formula [a < f| evaluating to true on state (s,h) if and 
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Table 6. Inductive definition of [a < f] for a given a € P. 


f € QSL [X] [a x f] € SL [2] 

i] true if a = 0 else % 

[7] -g + [Pr]; u (mA fa x gl) V Or A fa x ul) 
q:-g+(1—q)-u V sevala] yevallu] p 8+0 -pza TESIA uU] 
g-u V gevallg],yevallu] syza [8 Il ATY u] 

l-g true if a = 0 else — [6 < f] 


for 6 = min { 6 € Val [g] | 8 > 1-—a} 


gmaxu faxg|Vl[axu] 

gminu flaxg|Afaxul 

eu: g da: fa < g] 

Cx: g Va: fax g] 

gxu V sevallg] yeva] syza [Bgl > [Y u] 
[y] — 9 p — [a <g] 


only if f evaluates at least to a on (s, h). This construction relies on Val [f] from 
the previous section and is given by induction on the structure of f as shown in 
Table 6. We consider the individual constructs. For that, we fix some state (s, h). 


The case f = [4]. There are two cases. If a = 0, then a trivially lower bounds 
the value of [7]. Conversely, if a > 0, then a lower bounds [yY] on state (s, h) if 
and only if (s, h) satisfies w. 


For the composite cases, recall that by Theorem 3 there are effectively con- 
structible finite sets Val [g] , Val [u] covering all values g and u evaluate to. 


The case f = |n]-g+[>r]-u. The formula f represents a Boolean choice between 
the formulae g and u, depending on the truth value of m. Hence, there are two 
cases: If (s, h) does satisfy 7, then a lower bounds f iff a lower bounds g. Con- 
versely, if (s, h) does not satisfy 7, then a lower bounds f iff œ lower bounds u. 


The case f = p- g + (1 — p): u. Since Val{g] and Val [u] cover every possible 
valuation of g and u, respectively, it follows that œ lower bounds the valua- 
tion of f if and only if there are 8 € Val[g] and y € Valfu] such that (1) 6 
lower bounds g, (2) y lower bounds u, and (3) œ lower bounds the convex sum 


pit (lpi 


The case f =g-u. The reasoning is analogous to the previous case. 
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The case f = 1 — g. We write a < [1-— g] (s,h) equivalently as ~(1 — œa < 
[g] (s, h)). In order to turn the strict inequality into a non-strict one, we con- 
sider the successor 6 of 1 — a in Val |g], i.e., the smallest 6 in Val [g] greater than 
1— a. Since Val [g] is finite, such a ô always exists if 1— œ # 1. We illustrate the 
idea in the following picture, where all elements in Val [g] are marked by e. 


eee o | e o 4 os. 


0 l-a ô [g](s,h) 1 


For the successor 6, checking if 6 is a lower bound of |g] (s, h) is equivalent to 
checking if 1 — a is a strict lower bound - if ô is not a lower bound, then we ran 
out of possible valuations that are strictly lower bounded by 1 — a. 


The case f = gou foro € {max, min}. The probability a lower bounds the 
maximum of g and u on state (s,h) if and only if a lower bounds g or a lower 
bounds u. For o = min, the reasoning is dual. 


The case f = ex: g. Recall that 


[f] (s,2) = max { [g] (s [x:=v],h) | v € Vals} . 


Now observe that œ lower bounds the above maximum if and only if œ lower 
bounds some element of the above set, i.e., if and only if there is some v with 


a < |g] (s[x:=0],h) which is equivalent to (s,h) = 


xz: [asg]. 
The case f = Ca: g. Recall that 
[f] (s,h) = min { [g] (s [x:=v],h) | v € Vals} . 


Since a lower bounds the above minimum if and only if œ lower bounds all ele- 
ments of the above set, the reasoning is dual to the previous case. 


The case f = g x u. Recall that 


[F] (s,h) = max {fg] (s, hı): [u] (s,h2) | h = hi xha} . 


Since Val [g] and Val [u] cover every possible valuation of g and u, respectively, 
a lower bounds the evaluation of f on (s, h) iff there are 8 € Val [g] ,y € Val [u] 
and hy, ha with hı xha = h such that (1) 8 lower bounds g on (s, hy), (2) y lower 
bounds u on (s, h2), and (3) a lower bounds £ - y. Given such 8 andy, we can 
phrase this equivalently in SL [2l] as 


(sh) H (Bsglalys ul. 


The case f = [Y] — g. Recall that 


Lf] (s,h) = inf {Ig} (s,h«h’) | K L hand [w](s,h')=1} . 


Foundations for Entailment Checking in Quantitative Separation Logic 75 


Probability œ lower bounds the above infimum if and only if for every extension 
h’ of the heap h such that the stack s together with h’ satisfy w, probability a 
is a lower bound on |g] (s, hx h’). Put more formally, the latter statement reads 


for all h’ L h with (s,h') Ew: (s,hxh') Elaxg], 


which is equivalent to (s, h) = y — [a x g]. 


Our construction thus applies to arbitrary QSL [2{] formulae and we get: 


Theorem 4. For every f € QSL [2] and all a € P there is an effectively con- 
structible SL [XA] formula [a < f| such that for all (s,h) € States, we have 


(s, h) = [ars F] iff as [fl] (s, 2) . 
Proof. By induction on f. 


Finally, we obtain our main theorem. 


Theorem 5. Entailment checking in QSL [XA] reduces to entailment checking in 
SL [XA], i.e, for all f,g € QSL [A], we have 


fT Eg if  foralla € Val|f]: [a< f| H= fazgl. 


Proof. Follows from Theorems 3 and 4 and the reasoning at the end of Section 4.1. 


Remark 1 (Avoiding true in SL[A] entailments). Formulae of the form fa < 
f| € SL [2] may introduce the atom true, which is not admitted by some decid- 
able separation logic fragments, such as [27]. Fortunately, we can avoid true in 
[a < f| formulae. true is only required in formulae of the form [0 < f], which 
arise in two situations when applying Theorem 5: (1) in entailment checks of the 
form [0 < f] H [0 < g], which always hold and can thus be omitted, and (2) if 
f=p-g+(1—p)-u. In the latter case, if we have a 4 0 in 


[ox f] = V [8s] A fy 3u], 


BEVal[g],yEVal[u] p: 86+ (1-p) 72a 


then either 6 Æ 0 or y Æ 0 holds for every disjunct. Hence, subformulae of the 
form [0 < g] or [0 < u] can be omitted, as well. A 


5 Complexity 


We now analyze the complexity of our approach. Recall that Theorem 5 reduces 
checking f H g in QSL [A] to checking 


for alla € Val [f]: [a < f| = fe < g] 


in SL [2l]. We consider two aspects: (1) the number of SL [2U] entailments and 
(2) the size of the resulting SL [Ql] formulae occurring in each entailment. We 
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express these quantities in terms of the size of a QSL [2] formula f and a SL [X] 
formula y and denote them as |f| and |y| respectively. In these sizes, we count 
every construct in the formula and require that the size of atoms are defined 
at instantiation. Moreover, we assume that every atom in 2 is at least of size 
1 and especially the atom true is of size 1. Additionally we count in an QSL [X] 
formula f the constructs that increase the number of possible evaluation results 
of f, namely q: g + (1— q) - u, g: wand gxu, and denote it as |f|,.° 

We will see that for an entailment f = g in QSL [X], (1) the number of SL [2{] 
entailments is in 20'f!») in the worst case (see Theorem 6) and (2) the size of the 
resulting SL [X] formulae are in O(| f|) -20(\flp) and O(\gļ) 22091) respectively in 
the worst case (see Theorem 7). Now let us assume we have an entailment checker 
for SL [XA] formulae that can solve entailments of the form [a < f| = [a < gl 
and which has a runtime complexity of SL-Time(n, m) where n and m are the 
size of SL [X] formulae on the left and right side of an entailment respectively. 
Putting the above together, checking the entailment f g in QSL [A] then has 
a runtime complexity of 


90 (fle) . SL-Time (ols) -2C UF) O(|g}) - 2001} D) 
+O(|f I) 200 + O(|g]) 2909) 


If we furthermore reasonably assume that SL-Time(n, m) is at least linear in both 
arguments (otherwise the entailment checker can only check trivial entailments 
anyway), the runtime complexity simplifies to 


O(lfle) . SL-Time (olf): 2NF) O(|g|) - 200 P) , 


As for aspect (1), we first observe that checking f H g by means of Theorem 5 
requires checking |Val [f] | entailments in SL [2l]. However, only the constructs we 
count with |f|, increase the number of possible evaluations, which in turn will 
also increase the size of the overapproximation Val [f]. Every time any of these 
constructs occur, the number of possible evaluations Eval (f) may double. Con- 
sequently, also the overapproximation Val [f] doubles in size when any of these 
constructs occur. Other constructs do not increase the number of evaluations, 
but instead inherit the evaluations from their subformulae. 


Theorem 6. We have |Val[f]| < 2!fle+!. Hence, checking f = g by means of 
Theorem 5 requires checking 20 fl») entailments in SL [X]. 


Proof. By induction on f. 


For the size of the resulting SL [X] formulae, i.e., aspect (2), recall that we con- 
struct entailments of the form 


lax f] = laxg]- 


8 For a formal definition see [7]. 
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We thus determine an upper bound on the size of any SL [X] formula [a < f]. 
Here we make a similar observation as in aspect (1): whenever one of the con- 
structs we count with |f|, appears, the size of the formula increases by the expo- 
nential factor |Val[f]|. Such a multiplication of increasing exponential expres- 
sions then results asymptotically in a squared exponent. The other constructs 
increase the size by only a constant per construct. By combining both observa- 
tions we can finally conclude an upper bound on the size of the formula [a < f]. 


Theorem 7. For any formula f E€ QSL [XA] and all probabilities a € P, the SL [XA] 
formula [a x f| has at most size 3-|f| -a\fle+)” Hence the size of the formula 
[a x f] is in O(|fl) 20040), 


Proof. By induction on f. 


Remark 2 (Complexity of SL [2] Entailments in QSL |A). By Theorem 6 and 
Theorem 7, the number of entailments and the size of formulae [a < f] is only 
exponential if | f|,, is not constant. However, we would assume that an entailment 
f | gin QSL [2], where neither in f nor in g the probabilistic choice p-g+(1—p)-u 
appears, should have a similar runtime complexity as SL [X] entailment. While 
it is easy to see that Val[f] = {0,1} has constant size in this setting, the size 
of the formula is still exponential. In the case where no probabilistic choice is 
present, we generate multiple exponentially-sized tautologies of the form [0 < f]. 
However, due to Remark 1 we can eliminate all occurrences of [0 < f]. That 
means, if f does not contain p-g+(1—p)-u, then for a Æ 0, we can construct 
an equivalent formula to [a < f] in such a way that its size is in O(|f|) and 
\Val [f]| = 2. 


6 Application: Decidable hpGCL Verification 


Since entailment in full separation logic is undecidable, it is common to con- 
sider fragments of separation logic with a (semi-)decidable entailment problem. 
Given a QSL [2] fragment Q, we provide sufficient and easy-to-check characteri- 
zations on SL [X] fragments S ensuring that entailment checking in Q reduces to 
entailment checking in S. This simplifies the search for decidable fragments of 
quantitative separation logic. 

We then apply our results in Section 6.1 to show the decidability of entail- 
ment checking for quantitative symbolic heaps—a quantitative extension of the 
well-known symbolic heap fragment of separation logic—and demonstrate the 
applicability to the verification of probabilistic pointer programs. 


Our reduction from entailments in QSL [2{] to entailments in SL [2{] relies on the 
construction of the [a < f| formulae from Section 4.3. This suggests to define: 


Definition 3. Let Q be a QSL [2] fragment. We say that an SL [Ql] fragment S 
is Q-admissible if [a < f| € S holds for all f E€ Q and alla € P. A 


78 Batz et al. 


Table 7. SL [2] requirements for entailment checking in QSL [X]. 


Q fragment contains S contains/is closed under 
[a] Y, true 
[n]: f+ [on] g mam, A, V 
p-f+(l-p)-g A,V 

feg A,V 

1-—f a, true 

f max g V 

f ming A 

eu: f g 

bg: F y 

fxg x, V 

Y] — f Y — 


The syntactic nature of our construction of the S formulae [a < f] allows for a 
syntactic criterion on SL [X] fragments to be Q-admissible. 


Lemma 1. Let Q be a QSL [2] fragment. If an SL [2] fragment S satisfies the 
requirements provided in Table 7, then S is Q-admissible. 


Proof. By induction on f. 


Finally, we provide a sufficient criterion for the decidability of entailment in 
QSL [2] fragments given SL [XA] fragments with a decidable entailment problem. 
Since entailment checks y } w in SL [2l] can often (but not always) be reduced 
to unsatisfiability checks y A =wW, we take a more fine-grained perspective and 
distinguish between fragments for the left- and the right-hand side of entail- 
ments, respectively. This distinction matters when, e.g., SL [X] fragments with a 
decidable satisfiability problem impose restrictions on quantifiers (cf., [20]). 


Theorem 8. Let Q;,Q2 be QSL [2] fragments, and let S1,S2 be SL [2] frag- 
ments. If Sı is Q,-admissible and Sz is Qo-admissible, then 


pew for p € S1, Y E€ Sg is decidable 
implies gE f for g E€ Qi, f E Q2 is decidable . 


Proof. This is a consequence of Theorem 5. 


6.1 Quantitative Symbolic Heaps 


We now demonstrate that our approach can facilitate the automated verification 
of probabilistic pointer programs by providing a sample QSL fragment with a 
decidable entailment problem. 

Recall that QSL [2] is parameterized by a set 2 of predicate symbols. We 
obtain the quantitative symbolic heap fragment of QSL by instantiating A. 
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Definition 4. Let A be the set of predicate symbols given by 
A = {true,emp} U {x (y1,.--, YK) | 2, Y1,---, Ym E Vars } 
U {xz=y, ty, e=yAemp, «x AyAemp | z,y€ Vars} . 
Then the set QSH of quantitative symbolic heaps is given by the grammar 
f > WI mftr f| afta- f | ex: f | fsf. A 


Quantitative symbolic heaps naturally extend the symbolic heap fragment of 
separation logic. Intuitively speaking, a quantitative symbolic heap f specifies 
probability (sub-)distributions over (symbolic) heaps. By applying Theorem 5, 
we obtain the following decidability result. 


Theorem 9. For loop- and allocation-free hpGCL programs C (that only per- 
form pointer operations, no arithmetic, and guards from the pure fragment of 2) 
and fı, fa E€ QSH, it is decidable whether the entailment wlp[C] (f1) = f2 holds. 


Hence, for loop- and allocation-free programs C as above, upper bounds (in terms 
of quantitative symbolic heaps f2) on the probability wlp[C] (f1) of terminating 
in a given quantitative symbolic heap fı are decidable. We refer to Section 3.3 
for an example entailment involving quantitative symbolic heaps. In the sequel, 
we show how to prove the above result. 


Proof of Theorem 9. The proof relies on extended quantitative symbolic heaps 
eQSH, which include magic wands with points-to formulae on their left-hand side. 


Definition 5. The set eQSH of extended quantitative symbolic heaps is given 
by the grammar 


g > W] | i] g+ir]-g | eeotl—a)9 | gee 
| Ca:g | [z> (y1, Yk] — g . A 


Notice that indeed QSH C eQSH. 


Lemma 2. For every loop- and allocation-free program C € hpGCL without 
arithmetic and only with guards of the pure fragment of 2, extended quantitative 
symbolic heaps are closed under wlp[C], i.e., 


for all g € eQSH: wlp[C] (g) € eQSH . 
In particular, since QSH C eQSH, we have 
for all f E€ QSH: wlp]C] (f) € eQSH . 
Proof. By induction on the structure of loop- and allocation-free program C. 


Hence, if g H f is decidable for g € eQSH and f € QSH, Theorem 9 follows. 
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Lemma 3. For g € eQSH and f € QSH, it is decidable whether g = f holds. 


Proof. We employ Lemma 1 to determine two SL [X] fragments S1, S2 such that 
Si is eQSH-admissible and Sə is QSH-admissible. Then, by Theorem 8, decid- 
ability of g = f follows from decidability of p EK vw for p € Sı and w € Sg. For 
that, we exploit the equivalence 


pE% iff vy ^y is unsatisfiable . 


The latter is decidable by [20, Theorem 3.3] since y A aw is equivalent to a 
formula of the form 3*V*: V with v quantifier-free and no formula 30; —+ Və 
occurring in J contains a universally quantified variable. 


7 Related Work 


Weakest preexpectations. Weakest precondition reasoning was established in a 
classical setting by Dijkstra [19] and has been extended to provide semantic foun- 
dations for probabilistic programs by Kozen [38,37] and McIver & Morgan [41], 
who also coined the term weakest preexpectations. Their relation to operational 
models is studied in [25]. Moreover, weakest preexpectation reasoning has been 
shown to be useful for obtaining bounds on the expected resource consumption 
[45] and, in particular, the expected run-time [33] of probabilistic programs. 


Logics for probabilistic pointer programs. Although many algorithms rely on 
randomized dynamic data structures, formal reasoning about programs that are 
both probabilistic and heap manipulating has received scarce attention. A no- 
table exception is the work by Tassarotti and Harper [51], who introduce a con- 
current separation logic with support for probabilistic reasoning, called Polaris. 
Their focus is on program refinement, employing a semantic model that is based 
on the idea of coupling, which underlies recent work on probabilistic relational 
Hoare logics [4]. However, no other decision procedures targeting entailments for 
QSL or other logics targeting probabilistic pointer programs exist. 


Leveraging SL research. As shown in Table 7, building QSL entailment checkers 
by employing our reduction technique requires the availability of SL fragments 
that support certain logical operations, and whose entailment problem is decid- 
able. Since the inception of separation logic [29], the latter has been extensively 
studied. In particular, the symbolic heap fragment of SL has received a lot of 
attention. Table 8 gives an overview of related approaches. ° 


? x is always covered. Supported (Boolean or separating) connectives are marked with 


“4” unsupported ones with “—”. “x” means that the restrictions on the connec- 
’ 
tive are more involved. “Pure” means that the connective can only appear in pure 


formulae and “flat” means that the quantifier needs to be on the outermost level. 
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Table 8. SL fragments with decidable entailment problem. 


Paper a A V—# dAv Ind. predicates Complexity 

1] pure pure pure — flat — user defined ExpTIME-hard 

11] [17] - pure - - - - Lists Polynomial 

21 + user defined 2-ExPTIME-complete 
22 - - + =- +- user defined 2-EXPTIME-complete 
at flat user defined ? 

28 — pure - -— flat- user defined ExpTIME-complete 
35 user defined 2-EXPTIME 

40 * * user defined 2-EXPTIME 

47 Eo = == = ? 

18 + Lists PSPACE-complete 

20 + * ok Ok — PSPACE-complete 


8 Discussion and Conclusion 


We studied entailment checking in QSL by means of a reduction to entailment 
checking in SL. We analyzed the complexity of our approach and demonstrated 
its applicability by means of several examples. In particular, our reduction yields 
the first decidability result for probabilistic pointer program verification. 

Our primary goal was to investigate the entailment problem for QSL to pave 
the way for automated verification of probabilistic pointer programs. Theorem 8 
provides a generic result that enables building upon the large body of work 
dealing with classical SL entailments to obtain both theoretical and practical 
insights. Theoretically, Theorem 8 gives sufficient criteria to derive QSL frag- 
ments with a decidable entailment problem from a classical SL fragment. We 
derived a QSL fragment such that reasoning about a simple probabilistic heap- 
manipulating language becomes decidable. More practically, Theorem 8 allows 
reusing existing (possibly incomplete) SL solvers to solve the entailments de- 
rived by our construction—an empirical evaluation of how well existing solvers 
can deal with these entailments is an interesting direction for future work. 

We believe that our fine-grained complexity analysis demonstrates that our 
approach can be practically feasible: the exponential blow-up in Theorem 7 stems 
from the number of probabilistic constructs in the given QSL formulae. We ex- 
pect the number of such constructs to be small for many randomized algorithms. 
We remark that existing approaches on checking quantitative entailments be- 
tween heap-independent expectations encounter similar exponential blow-ups 
(cf., [36,6]). There is thus some evidence that such exponential blow-ups do 
not prohibit one from automatically verifying non-trivial properties. We are not 
aware of work on checking quantitative entailments between expectations that 
avoids such exponential blow-ups. 

Future work includes considering richer classes of QSL and applications of 
entailment checking such as k-induction [6]. Another interesting direction is the 
applicability of our reduction to other approaches that aim for local reasoning 
about the resources employed by probabilistic programs, such as [51,3,5]. 
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Abstract. We present a logical system CFP (Concurrent Fixed Point 
Logic) that supports the extraction of nondeterministic and concurrent 
programs that are provably total and correct. CFP is an intuitionistic 
first-order logic with inductive and coinductive definitions extended by 
two propositional operators, B|4 (restriction, a strengthening of impli- 
cation) and ||(B) (total concurrency). The source of the extraction are 
formal CFP proofs, the target is a lambda calculus with constructors and 
recursion extended by a constructor Amb (for McCarthy’s amb) which 
is interpreted operationally as globally angelic choice and is used to im- 
plement nondeterminism and concurrency. The correctness of extracted 
programs is proven via an intermediate domain-theoretic denotational 
semantics. We demonstrate the usefulness of our system by extracting 
a nondeterministic program that translates infinite Gray code into the 
signed digit representation. A noteworthy feature of our system is that 
the proof rules for restriction and concurrency involve variants of the 
classical law of excluded middle that would not be interpretable compu- 
tationally without Amb. 


1 Introduction 


Nondeterministic bottom-avoiding choice is an important and useful idea. With 
the wide-spread use of hardware that supports parallel computation, it has the 
possibility to speed up practical computation and, at the same time, it is related 
to computation over mathematical structures like real numbers [20,42]. On the 
other hand, it is not easy to apply theoretical tools like denotational semantics 
to nondeterministic bottom-avoiding choice [24,29] and guaranteeing correctness 
and totality of such programs through logical systems is a difficult task. 

To explain the subtleness of the problem, let us start with an example. Sup- 
pose that M and N are partial programs that, under the conditions A and 7A, 
respectively, are guaranteed to terminate and produce values satisfying specifica- 
tion B. Then, by executing M and N in parallel and taking the result obtained 
first, we should always obtain a result satisfying B. This kind of bottom-avoiding 
nondeterministic program is known as McCarthy’s amb (ambiguous) operator 
[32], and we denote such a program by Amb(M, N). Amb is called the angelic 
choice operator and is usually studied as one of the three nondeterministic choice 
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operators (the other two are erratic choice and demonic choice). On the other 
hand, we are interested in this operator not only from a theoretical point of view 
but also from the way it behaves as a concurrent program running on a parallel 
execution mechanism. 

If one tries to formalize this idea naively, one will face some obstacles. Let 
Mr B (“M realizes B”) denote the fact that a program M satisfies a specification 
B and let || (B) be the specification that can be satisfied by a concurrent program 
of the form Amb(M, N) that always terminates and produces a value satisfying 
B. Then, the above inference could be written as 


A>(MrB) 7AA->(NrB) 
Amb(M,N)r ||(B) 


However, this inference is not sound for the following reason. Suppose that A 
does not hold, that is, ~A holds. Then, the execution of N will produce a value 
satisfying B. But the execution of M may terminate as well, and with a data that 
does not satisfy B since there is no condition on M if A does not hold. Therefore, 
if M terminates first in the execution of Amb(M, N), then we obtain a result 
that may not satisfy B. 

To amend this problem, we add a new operator B|,4 (pronounced “B re- 
stricted to A”) and consider the rule 


Mr (Bla) Nr(Blaa) 
Amb(M, N)r ||(B) (1) 


Intuitively, M r (B|,4) means two things: (1) M terminates if A holds, and 
(2) if M terminates, then the result satisfies B even for the case A does not hold. 
As we will see in Sect. 5.2, the above rule is derivable in classical logic and can 
therefore be used to prove total correctness of Amb programs. 

In this paper, we go a step further and introduce a logical system CFP 
whose formulas can be interpreted as specifications of nondeterministic programs 
although they do not talk about programs explicitly. CFP is defined by adding 
the two logical operators B|,4 and ||(B) to the system IFP, a logic for program 
extraction [12] (see also [4,9,7]). A related approach has been developed in the 
proof system Minlog [38,6,39]. IFP supports the extraction of lazy functional 
programs from inductive/coinductive proofs in intuitionistic first-order logic. It 
has a prototype implementation in Haskell, called Prawf [8]. 

We show that from a CFP-proof of a formula, both a program and a proof 
that the program satisfies the specification can be extracted (Soundness theorem, 
Theorem 3). For example, in CFP we have the rule 


Bla Blaa 
{(B) (2) 
which is realized by the program Xa.Ab.Amb(a,b), and whose correctness is 


expressed by the rule (1). Programs extracted from CFP proofs can be executed 
in Haskell, implementing Amb with the concurrent Haskell package. 


(Conc-lem) 
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Compared with program verification, the extraction approach has the benefit 
that (a) the proofs programs are extracted from take place in a formal system 
that is of a very high level of abstraction and therefore is simpler and easier to use 
than a logic that formalizes concurrent programs (in particular, programs do not 
have to be written manually at all); (b) not only the complete extracted program 
is proven correct but also all its sub-programs come with their specifications and 
correctness proofs since these correspond to sub-proofs. This makes it easier to 
locally modify programs without the danger of compromising overall correctness. 

As an application, we extract a nondeterministic program that converts in- 
finite Gray code to signed digit representation, where infinite Gray code is a 
coding of real numbers by partial digit streams that are allowed to contain a 
L, that is, a digit whose computation does not terminate [18,42]. Partiality and 
multi-valuedness are common phenomena in computable analysis and exact real 
number computation [46,30]. This case study connects these two aspects through 
a nondeterministic and concurrent program whose correctness is guaranteed by 
a CFP-proof. The extracted Haskell programs are available in the repository [3]. 

Organization of the paper: In Sects. 2 and 3 we present the denotational and 
operational semantics of a functional language with Amb and prove that they 
match (Thms. 1 and 2). Sects. 4 and 5 describe the formal system CFP and 
its realizability interpretation which our program extraction method is based 
on (Thms. 3 and 5). In Sect. 6 we extract a concurrent program that converts 
representation of real numbers and study its behaviour in Sect. 7. Most proofs, 
unless very short, are omitted do to space limitation. Full proofs of the main 
results can be found in the extended version [11]. 


2 Denotational semantics of globally angelic choice 


In [32], McCarthy defined the ambiguity operator amb as 


x (e #1) 
amb(z,y) = 4 y (y # 1) 
1 (xz=y= L) 


where L means ‘undefined’ and x and y are taken nondeterministically when 
both x and y are not L. This is called locally angelic nondeterministic choice 
since convergence is chosen over divergence for each local call for the computa- 
tion of amb(z, y). It can be implemented by executing both of the arguments 
in parallel and taking the result obtained first. Despite being a simple construc- 
tion, amb is known to have a lot of expressive power, and many constructions 
of nondeterministic and parallel computation such as erratic choice, countable 
choice (random assignment), and ‘parallel or’ can be encoded through it [28]. 
These multifarious aspects of the operator amb are reflected by the difficulty of 
its mathematical treatment in denotational semantics. For example, amb is not 
monotonic when interpreted over powerdomains with the Egli-Milner order [14]. 

On the other hand, one can consider an interpretation of amb as globally 
angelic choice, where an argument of amb is chosen so that the whole ambient 
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computation converges, if convergence is possible at all [17,40]. Since globally 
angelic choice is not defined compositionally, it is not easy to integrate it into a 
design of a programming language with clear denotational semantics. However, 
it can be easily implemented by running the whole computation for both of the 
arguments of amb in parallel and taking the result obtained first. Denotationally, 
globally angelic choice can be modelled by the Hoare powerdomain construction. 
However, this would not be suitable for analyzing total correctness because the 
ordering of the Hoare powerdomain does not discriminate X and XU{-L} [23,24]. 
Instead, we consider a two-staged approach (see Sect. 2.2). 

The difference between the locally and the globally angelic interpretation of 
amb is highlighted by the fact that the former does not commute with function 
application. For example, if f(0) = 0 but f(1) diverges, then amb(/(0), f(1)) 
will always terminate with the value 0, whereas f(amb(0,1)) may return 0 
or diverge. On the other hand, the latter term will always return 0 if amb is 
implemented with a globally angelic semantics. As suggested in [17], we use this 
commutation property to realize the globally angelic semantics. 


2.1 Programs and types 


Our target language for program extraction is an untyped lambda calculus with 
recursion operator and constructors as in [12], but extended by an additional 
constructor Amb that corresponds to globally angelic version of McCarthy’s 
amb. This could be easily generalized to an Amb operator of any arity > 2. 


Programs > M, N, L, P,Q, R ::= a,b,..., f,g (program variables) 
| \a.M | MN | M}N | recM |1 
| Nil | Left(M) | Right(M) | Pair(M, N)| Amb(M, N) 
| case M of {Left(a) > L; Right (b) > R} 
| case M of {Pair(a, b) > N} 
| case M of {Amb(a, b) > N} 


Denotationally, Amb is just another pairing operator. Its interpretation as glob- 
ally angelic choice will come to effect only through its operational semantics. 
Though essentially a call-by-name language, it also has strict application MJN, 
needed for realizing the rules for restriction and the concurrency operator. 

We use a,...,g for program variables to distinguish them from the vari- 
ables x, y, z of the logical system CFP (Sect. 4). Nil, Left, Right, Pair, Amb are 
called constructors. Constructors different from Amb are called data construc- 
tors. Ca denotes the set of data constructors. Left} M stands for (Aa.Left(a))|M, 


etc., and we sometimes write Left and Right for Left(Nil) and Right(Nil). 


Natural numbers are encoded as 0 22 Left, 1 Pet Right(Left), and so on. 


Although programs are untyped, programs extracted from proofs will be 
typable by the following system of simple recursive types: 


Types > p,o ::= a (type variables) |1|pxoa|pt+oa|p>ca|fixa.p|A(p) 
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Here, A(p) is the type of programs which, if they terminate (see Sect. 3), reduce 
to a form Amb(M,N) with M,N: p. The formation of fixa.p has the side 
conditions that a occurs freely in p, p is strictly positive in a (that is, there is no 
free occurrence of a in p which is in the left part of a function type), and not of 
the form a or A(a). These conditions ensure, among other things, that the type 
transformer a +> p has a unique fixed point, which is taken as the semantics of 
fix a.p (see below). We require in A(p) that p is neither a variable nor of the 
form fixa,....fixa,.A(p’) (n > 0). This enables the interpretation of Amb 
as a bottom-avoiding choice operator (see the explanation below Corollary 1). 
We call types that satisfy all these conditions regular. An example of a regular 


type is the type of lazy (partial) natural numbers, nat Ref fxa.1 +a. 


Tva:pra:p FRNil:1 Pir Lp 
TEM:p PEM:a 
I} Left(M):p+o [+ Right(M):p+o 
TEM: p IFEN:o0 TrEM:p TEN:p 
I} Pair(M,N):pxo I’ + Amb(M, N) : A(p) 
Tra:prM:a T,a:ph Ma: 
oe aa jecor nee) 
r-M:p>o0 r-N:p r- -M:p>o0 r-N:p 
TEFEMN:oa TFEFMIN: a 
TEM: plfixa. p/a] TELM :fixa.p 
r- M : fixa.p TEM: plfixa. p/a] 


TEM:pt+o TIT,a:pFL:T T,b:oF Ret 
IT - case M of {Left(a) > L; Right(b) > R}:7 


TEM:pxoa T,a:p,b:0F Nit TEM:A(p) I,a,b: pr N:T 
I} case M of {Pair(a,b) > N}:7 I’ case M of {Amb(a, b) > N}:7 


Fig. 1. Typing rules 


The typing rules are listed in Fig. 1. They are valid w.r.t. the denotational 
semantics given in Sect. 2.2 and extend the rules given in [12]. Recursive types 
are equirecursive [35] in that M : fixa.p if M : pifixa.p/al. 

As an example of a program consider 


f PÉ \a.case a of {Left(_) > Left; Right(_) > 1} (3) 


which implements the function f discussed earlier, i.e., fO = 0 and f1 = L. 
f has type nat > nat. Since Amb(0,1) has type A(nat), the application 
f Amb(0, 1) is not well-typed. Instead, we consider mapamb f Amb(0, 1) where 
mapamb : (p + a) > A(p) > A (c) is defined as 


mapamb 2 Af. Ac. casecof {Amb(a,b) + Amb(fJa, f|b)} 
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This operator realizes the globally angelic semantics: mapamb f Amb(0, 1) is 
reduced to Amb(f 0, f}1), and f}0 and f}1 (which are the same as f 0 and f 1 
since 0 and 1 are defined) are computed concurrently and the whole expression 
is reduced to 0, using the operational semantics in Section 3. In Sect. 5, we will 
introduce a concurrent (or nondeterministic) version of Modus Ponens, (Conc- 
mp), which will automatically generate an application of mapamb. 


2.2 Denotational semantics 


The denotational semantics has two phases: Phase I interprets programs in a 
Scott domain D defined by the following recursive domain equation 


D = (Nil+Left(D)+Right(D)+Pair(Dx D)+Amb(DxD)+Fun(D > D))_. 


where + and x denote separated sum and cartesian product, and the operation 
-, adds a least element L ({21] is a recommended reference for domain theory 
and the solution of domain equations). A closed program M denotes an element 
[M] € D as defined in Fig. 2. Note that Amb is interpreted (like Pair) as a 
simple pairing operator. 

A type is interpreted as a subdomain, which is a subset of D that is down- 
ward closed and closed under suprema of bounded subsets. We use the following 
operations on subdomains: 

(X +Y), ™ {Left(a) | a € X} U {Right(b) |be Y}U{L} 

(X xY), = {Pair(a,b) |a € X,b E Y}U {1} 


(X >Y), 2 {Fun(f) | f : D > D continuous, Va € X(f(a) € Y)}U {1}. 


Through the semantics in Fig. 2, closed programs denote elements of D and 
closed types denote subdomains of D such that the typing rules (Fig. 1) are 
sound. 

In Phase IT we assign to every a € D a set data(a) C D that reveals the role 
of Amb as a choice operator. The relation ‘d € data(a)’ is defined (coinductively) 
as the largest relation satisfying 


dédata(a) 4 (a= Amb(a',b') Aa’ #4 1 Ad€é data(a’)) V 
(a = Amb(a’,b’) Ab 4 L Ad E data(b’)) v 
(os amb(L1)\Ad= 1) Vv 


VV («= cia) na CU) A e atte) V 
CECa i 
(a = Fun(f)\d=a) V (a=d= 1). 


Now, every closed program M denotes the set data([M]) C D containing all 
possible globally angelic choices derived form its denotation in D. For example, 
data(Amb(0,1)) = {0,1} and, for f as defined in (3), we have, as expected, 
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[an = n(a) 
[Aa. M]n = Fun(f) where f(d) = [M]n[a > d] 
[M N]n = f(INIn)_ if [MJ = Fun(/) 
[M}N]n = f(INIn)_ if [M]n = Fun(f) and [N] # L 
[rec M]7 = the least fixed point of f if [M]n = Fun(f) 
[C(M1,..., Mp)]n = C([Mi]n,..., [Mk]n) (C a constructor (including Amb)) 
[case M of Cl}]n = [K]n[@ = d| if [M]n = C(d) and C(@) > K € Cl 
[M]n = L in all other cases, in particular [L]n = L 


7 is an environment that assigns elements of D to variables. 


D =¢(a), D$ = {Nil, L}, 


D$ E NX <D | Dn) C X} (X <D means X is a subdomain of D) 


fixa. p 
Dip) = {Amb(a, b) | a,b € D$} U {1} 
Door = (Dgo D§)1 (0 € {+,x,>}) 


¢ is a type environment that assigns subdomains D to type variables. 


Fig. 2. Denotational semantics of programs (Phase I) and types 


data(mapamb f Amb(0, 1)) = data(Amb(0, L)) = {0}. In Sect. 3 we will define 
an operational semantics whose fair execution sequences starting with a regular- 
typed program M compute exactly the elements in data([M]). 


Example 1. Let M = rec A\a.Amb(Left(Nil), Right(a)). M is a closed program 
of type fixa.A(1+ a). We have data(M) = {0,1,2,...}. Thus, we can express 
countable choice (random assignment) with Amb. 


Lemma 1. Ifa € D belongs to a regular type, then the following are equivalent: 
(1)a€{L,Amb(1, L)}; (2) {L} = data(a); (3) L € data(a). 


3 Operational semantics 


We define a small-step operational semantics that, in the limit, reduces each 
closed program M nondeterministically to an element in data([M]) (Thm. 1). 
If M has a regular type, the converse holds as well: For every d € data([M]) 
there exists a reduction sequence for M computing d in the limit (Thm. 2). If M 
denotes a compact data, then the limit is obtained after finitely many reductions. 
In the following, all programs are assumed to be closed. 


3.1 Reduction to weak head normal form 


A program is called a weak head normal form (w.h.n.f.) if it begins with a 
constructor (including Amb), or has the form Aa.M. We define inductively a 
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small-step leftmost-outermost reduction relation ~ on programs where C ranges 
over constructors. 


(s-i) (Aa. M) N ~ M[N/a] 
a= e 
MN~ M'N 
(s-iii) (Aa. MJN ~ M[N/a] if N isa w.h.n.f. 
; M ~ M' ; ; 
(s-iv) MIN ~ M'IN if N is a w.h.n.f. 
1 
(s-v) Nw N 
MIN ~ MIN’ 


(s-vi) rec M ~ M (rec M) 

(s-vii) case C(M) of {...;C(b) > N;...} ~~ NIM /b] 
M ~ M' 

case M of {Cl} ~ case M’ of {Cl} 

(s-ix) M~ L if M is L-like (see below) 


(s-viii) 


t-like programs are such that their syntactic forms immediately imply that 
they denote |, more precisely they are of the form L, C(M) N, C(M)LN, and 
case M of {...} where M is a lambda-abstraction or of the form C(M) such 
that there is no clause in {...} which is of the form C(@) > N. W.h.n-f.s are 
never 1-like, and the only typeable L-like program is L. 


Lemma 2. (1) ~ is deterministic (i.e., M ~~ M' for at most one M’). 

(2) ~ preserves the denotational semantics (i.e., [M] = [M’'] if M ~ M’). 
(3) M is a~»-normal form iff M is a w.h.n.f. 

(4) [Adequacy Lemma] If [M] 4 L, then there is a w.h.n.f. V s.t. M ~~* V. 


3.2 Making choices 


Next, we define the reduction relation > (‘c’ for ’choice’) that reduces arguments 
of Amb in parallel. 


~ Mw M’' 
(c-i) c 
M ~ M' 
a Mı ~ Mi 
(c-ii) z 
Amb(M,, Mə) ~~ Amb(Mj, M2) 
Mo "A M; 


(c-ii’) z 
Amb(M,, Mə) ~~ Amb(M,, M3) 


(c-iii) Amb(M,, Mz) > Mı if Mı is a w.h.n.f. 
(c-iii’) Amb(M,, M2) > Ma if M2 is a w.h.n.f. 
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From this definition and Lemma 2, it is immediate that M is a -normal 
form iff M is a deterministic weak head normal form (d.w.h.n.f.), that is, a 
w.h.n.f. that does not begin with Amb. Finally, we define a reduction relation 
~~ that reduces arguments of data constructors in parallel. 


(pi) MoM 
M È M' 
7 M; > M! (i=1,...,k 
(pi) ) siden) 
C(M, , Mk) R C(M;, 23 z) 


Every (closed) program reduces under ~» (easy proof by structural induction). 
For example, Nil ~> Nil by (p-ii). In the following, all ~-reduction sequences 
are assumed to be infinite. 

We call a ~-reduction sequence unfair if, intuitively, from some point on, one 
side of an Amb term is permanently reduced but not the other. More precisely, 
we inductively define Mı wey Mo >... to be unfair if 

— each M; is of the form Amb(L;, R) (with fixed R) and L; ~ Li+1, or 

— each M; is of the form Amb(ZL, R;) (with fixed L) and R; ~ R41, or 

— each M; is of the form C(Nj1,..., Nin) (with a fixed n-ary constructor C) 
and Nj, any Nok >... is unfair for some k, or 

— the tail of the sequence, Mə Ka M3..., is unfair. 


A -reduction sequence is fair if it is not unfair. 

Intuitively, reduction by ~~» proceeds as follows: A program L is head reduced 
by ~ to a w.h.n.f. L’, and if L’ is a data constructor term, all arguments are 
reduced in parallel by (p-ii). If L’ has the form Amb(M, N), two concurrent 
threads are invoked for the reductions of M and N in parallel, and the one 
reduced to a w.h.n-f. first is used. Fairness corresponds to the fact that the 
‘speed’ of each thread is positive which means, in particular, that no thread can 
block another. Note that > is not used for the reductions of M and N in (s-ii), 
(s-iv), (s-v) and (s-viii). This means that ~> is applied only to the outermost 
redex. Also, (c-ii) is defined through ~, not ~>, and thus no thread creates new 
threads. This ability to limit the bound of threads was not available in an earlier 
version of this language [5] (see also the discussion in Sect. 8.1). 


3.3 Computational adequacy: Matching denotational and 
operational semantics 


We define Mp € D by structural induction on programs: 
C(M,,...,Mk)p =C(Mip,..., Mp) (C € Ca) 
(àa.M)p = [Aa.M] 
Mp=L otherwise 
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Since clearly M ~~» N implies Mp Ep Np, for every computation sequence 
My > Mı > ..., the sequence ((M;)p)ien is increasing and therefore has a 
least upper bound in D. Intuitively, Mp is the part of M that has been fully 
evaluated to a data. 

A computation of M is an infinite fair sequence M = Mo B My... 


Theorem 1 (Computational Adequacy: Soundness). For every computa- 
tion M = My > Mı > ..., Usen(Mj)p € data([M]). 


The converse does not hold in general, i.e. d € data([M/]) does not nec- 


essarily imply d = Ujen((M;)p) for some computation of M. For example, 


for M ?¥ rec àa. Amb(a, L) (for which [M] = [Amb(M, 1)]) one sees that 


d € data([M]) for every d € D while M ° M and Mp = L. But M has the 
type fixa.A(a) which is not regular (see Sect. 2.1). For programs of a regular 
type, the converse of Thm. 1 holds. 


Theorem 2 (Computational Adequacy: Completeness). If M has a reg- 
ular type, then for every d € data([M]), there is a computation M = Mọ => 
M, >... with d = Uien((Mi)p). 


A computation M = Mo e M, B... is productive if some M; is a determin- 
istic w.h.n.f. Clearly, this is the case iff Uien((Mi)p) Æ L. Therefore, by the 
Adequacy Theorem and Lemma 1: 


Corollary 1. For a program M of regular type, the following are equivalent. 


(1) One of the computations of M is productive. 
(2) All computations of M are productive. 
(3) [M] is neither L nor Amb(L, L). 


The corollary does not hold without the regularity condition. For example, M = 
Amb(Amb(Nil, Nil), Amb(L, L)) can be reduced to Mı = Amb(L, L) and 
then repeats M; forever, whereas it can also be reduced to Nil. McCarthy’s amb 
operator is bottom-avoiding in that when it can terminate, it always terminates. 
Corollary 1 guarantees a similar property for our globally angelic choice operator 
Amb. 


4 CFP (Concurrent Fixed Point Logic) 


In [12], the system IFP (Intuitionistic Fixed Point Logic) was introduced. IFP is 
an intuitionistic first-order logic with strictly positive inductive and coinductive 
definitions, from the proofs of which programs can be extracted. CFP is obtained 
by adding to IFP two propositional operators, B|4 and || (B), that facilitate the 
extraction of nondeterministic and concurrent programs. 


Extracting total Amb programs from proofs 95 


4.1 Syntax 


CFP is defined relative to a many-sorted first-order language. CFP-formulas 
have the form AA B, AV B, A > B, Va A, Jx A, s = t (s, t terms of the 
same sort), P(t) (for a predicate P and terms ¢ of fitting arities), as well as B|4 
(restriction) and ||(B) (concurrency). Predicates are either predicate constants 
(as given by the first-order language), or predicate variables (denoted X,Y,...), 
or comprehensions A7 A (where A is a formula and 7 is a tuple of first-order 
variables), or fixed points u(®) and v() (least fixed point aka inductive predicate 
and greatest fixed point aka coinductive predicate) where ® is a strictly positive 
(s.p.) operator. Operators are of the form AX Q where X is a predicate variable 
and Q is a predicate of the same arity as X. AX Q is s.p. if every free occurrence 
of X in Q is at a strictly positive position, that is, at a position that is not in the 
left part of an implication. We identify (AZ A)(¢) with A[f/zZ] where [f/Z] means 
capture avoiding substitution. 

The following syntactic properties of expressions (i.e., formulas, predicates 
and operators) will be important. A Harrop expression is one that contains at 
strictly positive positions neither free predicate variables nor disjunctions (V) 
nor restrictions (|) nor concurrency (||). An expression is non-Harrop if it is 
not Harrop; it is non-computational (nc) if it contains neither disjunctions, nor 
restrictions nor concurrency nor free predicate variables. Every nc-formula is 
Harrop but not conversely. Finally, we define, recursively, when a formula is 
strict: Harrop formulas and disjunctions are strict. A non-Harrop conjunction is 
strict if either both conjuncts are non-Harrop or it is a conjunction of a Harrop 
formula and a strict formula. A non-Harrop implication is strict if the premise is 
non-Harrop. Formulas of the form ox A (o € {V,4}) or O(AX Ad A) (O € {p, v}) 
are strict if A is strict. Formulas of other forms (e.g., Bla, ||(A), X(¢)) are not 
strict. The significance of these definitions is that Harropness ensures that (a 
proof of) the formula will have no computational content. Strictness ensures, 
among other things, that L is not a realizer (see Sect. 5). 

As an additional requirement for formulas to be wellformed we demand that 
in formulas of the form B|, or ||(B), B must be strict. 

Notation: P(£) will also be written t € P, and if is AX Q, then (P) stands 


for Q[P/X]. Definitions (on the meta level) of the form P re (®) (D € {u,v} 


where ® = AX A7 A, will usually be written P(g) = A|[P/X]. We write P CQ 


for Vz (P(Z) > Q(#)), Va € P A for Vx (P(x) > A), and 3x € P A for 


da (P(x) A A). aA Ref A — False where False '22' p(AX X). 


4.2 Proof rules 


The proof rules of CFP contain those of IFP, which are the usual natural de- 
duction rules for intuitionistic first-order logic with equality (see e.g. [53]), plus 
the following rules for induction and coinduction, where @ is a s.p. operator: 

@(P)C 


CP 
ee IND(@, P) 
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OF) 


COCL(#) ae 


v(®) C &(v()) SRA 


The rules for restriction and concurrency are (with the earlier mentioned condi- 
tion that in formulas of the form B|4 or || (B), B must be strict): 


A> (BoV Bi) 7A BoA Bi Rest-intro 
(Bo V By)\a (A, Bo, Bı Harrop) 


Ba A PPa Rest-bind 2 Rest-ret 
est-bin — = 
Bla Bla est-return 
A’ >A Bla Bla 
—;,—— Rest-antimon Rest-mp 
Bia 
Rest-ef Hla, Rest-stab 
B|False a B|~~a ERA 
A T A = Coneret 
—~—— Conc-lem ——~ Conc-return 
U(B) HUA) 
A>B WA) 
——— Conc-mp 
U(B) 


In Sect. 5 we will prove that each of these rules is realized by a program from 
our programming language in Sect. 2. 


4.3 Tarskian semantics, axioms and classical logic 


Although we are mainly interested in the realizability interpretation of CFP, it 
is important that all proof rules of CFP are also valid w.r.t. a standard Tarskian 
semantics, provided we identify B|,4 with A— B and ||(B) with B. 

Like IFP, CFP is parametric in a set A of axioms, which have to be closed 
nc-formulas. The significance of the restriction to nc-formulas is that these are 
identical to their (formalized) realizability interpretation (see Sect. 5), in partic- 
ular, Tarskian and realizability semantics coincide for them. Axioms should be 
chosen such that they are true in an intended Tarskian model. Since Tarskian 
semantics admits classical logic, this means that a fair amount of classical logic 
is available through axioms. For example, for each closed nc-formula A(Z), sta- 
bility, VZ (-7A(#) — A(Z)) can be postulated as axiom. In addition, the rule 
(Conc-lem) is a variant of the classical law of excluded middle and (Rest-stab) 
permits stability for arbitrary right arguments of restriction. 

In our examples and case studies we will use an instance of CFP with a 
sort for real numbers and some standard axiomatization of real closed fields 
formulated as a set of nc-formulas. In particular, we will freely use constants, 
operations and relations such as 0, 1, +, —, *, <, | - |, / and assume their expected 
properties as axioms (expressed as nc-formulas). 
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5 Program extraction 


We define a realizability interpretation of CFP that will enable us to extract 
concurrent programs from proofs. Since the interpretation extends the one in 
IFP [12], it suffices to define realizability for the restriction and concurrency op- 
erators and prove that their proof rules are realizable (Sects. 5.2). All definitions 
and proofs of this section can be carried out in a formal system RCFP (realiz- 
ability logic for CFP) which is CFP without | and || but with classical logic and 
an extended first-order language that contains the earlier introduced programs 
and types as terms and the typing relation ‘:’ as a predicate constant, and de- 
scribes their semantics through suitable axioms. In particular, RCFP proves the 
correctness of extracted programs (Soundness Theorem 3). Since it only matters 
that RCFP is classically correct (since no realizability interpretation is applied 
to it), details of RCFP do not matter and are therefore omitted. 


5.1 Realizability 


Realizability for CFP is formalized in RCFP and follows the pattern in [12]. 
For every non-Harrop CFP-formula A a type 7(A) and a RCFP-predicate R(A) 
are defined such that R(A) is a subset of r(A) (more precisely, RCFP proves 
Va(R(A)(a) > a: 7(A)) hence the interpretation of R(A) is a subset of Dza) 
We often write ar A for R(A)(a) (‘a realizes A’) and r A for Ja R(A)(a) (‘A is 
realizable’). 

Since Harrop formulas (see Sect. 4.1) have trivial computational content, it 
only matters whether they are realizable or not. Therefore, we define for a Harrop 
formula A, a RCFP-formula H(A) that represents the realizability interpretation 
of A, but with suppressed realizer. Formally, we define by simultaneous recursion, 
for every Harrop CFP-expression Æ an RCFP-expressions H(£), and for every 


non-Harrop CFP-expressions E an RCFP-expressions R(E). It is convenient to 


set, in addition, for Harrop formulas 7(A) Ret 1 and R(A) RE Na (a = NilA 


H(A)), so that 7(A) and R(A) are defined for all CFP-formulas. 

The complete definition, which is shown in Fig. 3, assumes that to each 
CFP predicate variable X there are assigned a fresh type variable ax and a 
fresh RCFP predicate variable X with one extra argument for domain elements. 
Furthermore, to define realizability for the fixed points of a Harrop operator 
AX P, we use the notation 


Hx (P) = H(P[X/X])[X/2] 


where X is a fresh predicate constant assigned to the (non-Harrop) predicate 
variable X. This is motivated by the fact that AX P is Harrop iff PLX/X] is. 
The idea is that Hx (P) is the same as H(P) but considering X as a (Harrop) 
predicate constant. 

To see that the definitions make sense, note that a formula P(t) is Harrop iff 
P is, predicate variables and disjunctions are always non-Harrop, a conjunction 
is Harrop iff both conjuncts are, an implication A — B is Harrop iff B is, and 
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For Harrop formulas A: 7(A) = 1 and R(A) = Aa (a = Nil ^A H(A)). 


T(E) for non-Harrop expressions E: 


a(P(t)) = T(P) T(AV B) = 7(A) +7(B) 
E x 7(B) (A, B non-Harrop) 
T(AA B)= 4 T(A) (B Harrop) 
T(B) (A Harrop) 
TE n ondtaen 
T(Bļla)=7T(B)  T(U(B)) = A(7(B)) 
T(ox A) = T(A) (o € {V;3}) 
7(X) =ax T(P)=1 (P a predicate constant) 

T(AZ A) = 7(A) T(O(AX P)) = fixax . T(P) (OE {u,v} 


R(E) for non-Harrop expressions E: 


R(P(é)) = àa (R(P)(é, a)) 
R(A V B) = Ac (da (c = Left(a) A ar A) V 3b (c = Right(b) A br B)) 


Ac (da, b (c = Pair(a,b) ^ar AA br B)) (A, B non-Harrop) 
R(AA B) = Aa (ar AA H(B)) (B Harrop) 
Ab (H(A) A br B) (A Harrop) 
Ac(c:7(A) > 7(B) AVa(ar A > (ca) r B)) (A non-Harrop) 
oe a a (b: 7(B) A (H(A) > br B)) Wie 


R(Bla) = àb (b:T(B) A^ (rA —>b# L)^(b# L—>brB)) 

R(I(B)) = àc3a,b (c = Amb(a,b) A a,b: 7(B)A (a # LVbÆ#L1)A^ 
(a+ L> ar B)^(b# L> br B)) 

R(©zx A) = Aa (Oa (ar A)) (O € {V, S}) 


R(X) = X R(Az A) = \(Z, a) (ar A) 
R(G(AX P))=TAXR(P)) (Ce {u,v} 


H(£) for Harrop expressions E: 


H(P() =H(P)()  H(AA B) = H(A) AH(B) 


rA—> H(B) (A non-Harrop) 
H(A) > H(B) (A Harrop) 


H(Ox A) =OrH(A) (0 € {V,3}) 


na> B) = { 


H(P)=P _ (P a predicate constant) H(Az A) = AZ H(A) 
H(O(AX P)) =OOAXHx(P)) (O€ {u,v}) 


Fig. 3. Realizability interpretation of CFP 
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Va A, da A, AZ A are Harrop iff A is. The rationale and correctness of realizability 
for restriction and concurrency are discussed in Sect. 5.2. 

If a formula A is nc, then it is Harrop (see Sect. 4.1 for definitions) but in 
addition A and H(A) are syntactically identical. In contrast, in general, a Harrop 
formula A neither implies nor is implied by H(A). 


Lemma 3. For every CFP-formula A: 


(1) T(A) is a regular type. 

(2) If A is strict, then L does not realize A, provably in RCFP. 

(8) Amb(L, L) is not a realizer of A. 

(4) For a program M that realizes A, t.f.a.e.: (i) M has some productive com- 
putation; (ii) all computations of M are productive; (iii) [M] # L. 


Proof. (1) and (2) are easily proved by structural induction on formulas. (3) 
follows from the fact that if A is of the form Amb(B), then B must be strict. 
(4) is proved by (3) and Corollary 1 (3). 


Remarks and examples. The main difference of our interpretation to the usual 
realizability interpretation of intuitionistic number theory lies in the interpreta- 
tion of quantifiers. While in number theory variables range over natural num- 
bers, which have concrete computationally meaningful representations, we make 
no general assumption of this kind, since it is our goal to extract programs from 
proofs in abstract mathematics. This is the reason why we interpret quantifiers 
uniformly, that is, a realizer of a universal statement must be independent of the 
quantified variable and a realizer of an existential statement does not contain 
a witness. A similar uniform interpretation of quantifiers can be found in the 
Minlog system. The usual definition of realizability of quantifiers in intuitionis- 
tic number theory can be recovered by relativization to an inductively defined 
predicate N describing natural numbers in unary representation: 


N(x) £2 =0VN(x—1) 
which is shorthand for N 2" WAX Ax (a = 0 V X(x —1))). The type T(N) 
assigned to N is the recursive type of unary natural numbers 


nat Df fxa.l+a. 


Realizability for N works out as 


ar N(x) = (a = Left \ x = 0) V Ib (a = Right (b) A br N(x — 1)). 


Thus, N(0), N(1), N(2) are realized by Left (i.e., Left(Nil)), Right(Left), 
Right(Right(Left)), and so on. Therefore, the (unique) realizer of N(n) is the 
unary representation of n. Other ways of defining natural numbers may induce 
different representations. An example of a formula with interesting realizers is 
the formula expressing that the sum of two natural number is a natural number, 


Va,y (N(z) + N(y) > N(x + y)). (4) 
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It has type nat — nat — nat and is realized by a function f that, given realizers 
of N(x) and N(y), returns a realizer of N(x + y), hence f performs addition of 
unary numbers. 


Example 2 (Non-terminating realizer). Let 


D(z) 2403 (4 <0Va>0). 


Then 7(D) = 2 where 2 = 1 + 1, and ar D(z) unfolds to 
a : T(2) A (a #0 > (a= Left A x < 0) V (a = Right ^ x > 0)). 


Therefore, D(x) is realized by Left if x < 0 and by Right if x > 0. If x = 0, any 
element of 7(2) realizes D(x), in particular L. Hence, nonterminating programs, 
which, by Lemma 3 (4), denote L, realize D(x). In contrast, strict formulas are 
never realized by a nonterminating program, as shown in Lemma 3 (2). 


5.2 Partial correctness and concurrency 


We explain realizability for B|4 and || (B) and show that the associated proof 
rules are sound. 

As we have seen in Example 2, a realizer of an implication A — B where 
A is a Harrop formula is realized by a ‘conditionally correct’ program M, that 
is, if H(A), then M realizes B, but otherwise no condition is imposed on M, in 
particular M may be non-terminating. However, M may terminate but fail to 
realize B. This means that termination of a realizer of A > B is not a sufficient 
condition for correctness (correctness meaning to realize B). But, as explained 
in the Introduction, this is what we need to concurrently realize a formula. The 
definition of realizability for the new logical operator | (shown in Fig. 3) achieves 
exactly this: A realizer of a restriction B| 4 is ‘partially correct’ in the sense that 
it is correct iff it terminates. By Lemma 3 (4), for a program M to realize B| 4 
means that M has type 7(B), and if A is realizable then all the computations of 
M are productive, and conversely, if M has a productive computation then M 
always (that is, independently of the realizability of A) realizes B. 

To highlight the difference between restriction and implication in a more 
concrete situation, consider (A V B)|4 vs. A —> (A V B) where A is Harrop. 
Clearly Left realizes A + (A V B), but in general (A V B)|,4 is not realizable. 
Note that Left even realizes A 5 (AV B) where 5 is Schwichtenberg’s uniform 
implication [39], hence restriction is also different from uniform implication. 

The intuition of Amb(a, b) realizing ||(A) is that it is a pair of candidate 
realizers at least one of which is productive, and each productive one is a realizer. 


Lemma 4. The rules for restriction and concurrency are realizable. 


Proof. The table below shows the realizers of each rule for the (most interesting) 
case where the conclusion is non-Harrop, using the definitions 


leftright 2°" \b.case bof {Left(_) > Left; Right(_) > Right}, 
mapamb = Af. Ac. casecof {Amb(a,b) > Amb(fla, f4b)}. 
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Proofs of their correctness are in [11]. For (Rest-intro), (Rest-stab), and (Conc- 


lem), classical logic is needed. Here, we set aseqb net (Ac. b)ļa. 


br(A > (Bo V Bi)) H(-A > Bo ^ B1) 
(leftright b) r (Bo V Bi)|a 


Rest-intro (A, Bo, Bı Harrop) 


arB|a fr(B—> (B'|a)) Rest-bind (B non-Harrop) ar B 


Rest-return 


(fla)r Bla ((aseq f)r B'|a (B Harrop)) ar B|4 
r(A’> A) arBla ENIT brB|a rA Rest 
aka est-antimon a est-mp 
Rest f brBla R is 
———— Rest-e — -S 
Lr BlFaise i br B|-~A i 
arB|a brBlaa ai arA c t 
Amb(a,b)r I(B) ~°""™"  Amb(a, L)r (A) “no 


fr(A—B) cr\l(A) Conc-mp (A non-Harrop) 
(mapamb f c)r||(B) (Amb(f,L)r I(B) (A Harrop)) 


Lemma 5. CFP derives the following rules. The rules are displayed together 
with their extracted realizers. 
ar Bo|Ao br Bila, H(-=-7(Ap V A1)) 


(1) ~Kmb(Left a, Right\b) x ll(Bo V Bi) 
ar(BVC)|p 


VL seat (len) Sls Right) 0) eC lance 


(C strict) 


Example 3. Continuing Example 2, we modify D(z) to 


Def (x < 0V z > 0)|zz0.- 


D'(x) 
A realizer of D' (x), which has type 2, may or may not terminate (non-termination 
occurs when x = 0). However, in case of termination, the result is guaranteed to 
realize x < 0 V a > 0. Note that, a realizer of D(x) also has type 2 and may or 
may not terminate, but there is no guarantee that it realizes x < 0V x > 0 when 
it does terminate. Nevertheless, D C D’ follows from (Rest-intro) (since =x 4 0 
implies x <0A a > 0) and is realized by leftright. D’ C D holds trivially. 


Example 4. This builds on the examples 2 and 3 and will be used in Sect. 6. Let 


t(x) = 1 — 2|z| and consider the predicates E(x) Tet D(x) A D(t(x)) and 


Def 


ConSD(z) = ||((x <0Vaz>0)V |z| < 1/2). 
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We show E C ConSD: From E(x) and Example 3 we get D’(x) and D’(t(z)) 
which unfolds to (x < OV a > O)|x4o0 and (|z| > 1/2 V |x| < 1/2)|\a)41/2- 
By Lemma 5 (2), (|x| < 1/2)|j2)<1/2. Since —7((x # 0) V |z| < 1/2), we have 
ConSD(x) by Lemma 5 (1). Moreover, T(E) = 2 x 2 and r(ConSD) = A(3) 


where 3 = 2+ 1. The extracted realizer of E C ConSD is 


conSD 2" \c.case cof {Pair(a,b) > Amb(Left|(leftright a), 
Right|(case bof {Left(_) > L; Right(_) > Nil}))} 


of type T(E C ConSD) = 2x 2 > A(3). Explanation of this program: a is Left 
or Right depending on whether x < 0 or x > 0 but may also be L if x = 0. b is 
Left or Right depending on whether |z| < 1/2 or |x| > 1/2 but may also be L if 
|x| = 1/2. Since z = 0 and x = 1/2 do not happen simultaneously, by evaluating 
a and b concurrently, we obtain one of them from which we can determine one 
of the cases x < 0, x > 0, or |x| < 1/2. 


5.3 Soundness and program extraction 


As we did in the above example, one can extract from any CFP-proof of a formula 
a program that realizes it. This property is called the Soundness Theorem of 
realizability. Its proof is the same as for IFP [12] but extended by the rules for 
the new logical operators whose realizability we proved in Sects. 5.2. 


Theorem 3 (Soundness Theorem I). From a CFP-proof of a formula A 
from a set of axioms one can extract a program M of type T(A) (which is a 
regular type) such that RIFP proves Mr A from the same axioms. 


In CFP, we have a second Soundness Theorem which ensures the correctness 
of all results of fair computation paths of an extracted program M. More pre- 
cisely, correctness of M means that all d € data([M]) realize the formula A7 
obtained from A by deleting all concurrency operators ||. Since A` is an IFP 
formula, the Theorem relates the realizability interpretations of CFP and IFP. 

However, such a correctness result only holds for formulas whose realizers do 
not contain Amb in the scope of a lambda-abstraction. This restriction is en- 
forced by the following syntactic admissibility condition: An expression is called 
admissible if it contains neither free predicate variables nor restrictions (|), and 
all occurrences of concurrency (||) are strictly positive and at non-F-position. 
Here, the notion of a subexpression at F-position in a CFP expression is de- 
fined inductively by three rules: (i) A subexpression of the form A — B where 
A and B are both non-Harrop is at F-position. (ii) A subexpression O AX Q 
(O € {py,v}) is at F-position if Q has a free occurrence of X at F-position. (iii) 
A subexpression within a subexpression at F-Position is at F-position. 

For example, || (u(AX Ax (x = 0 V Vy (N(y) > X(f(x,y)))))) is admissible, 
whereas p(AX Ax (x = 0 V Vy(N(y) > X(f(a,y))))) is not. The predicate 
ConSD in Example 4 is admissible. 


Theorem 4 (Faithfulness). Ifa € D realizes an admissible formula A, then 
all d € data(a) realize A`. 
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Theorems 3 and 4 imply: 


Theorem 5 (Soundness Theorem II). From a CFP proof of an admissible 
formula A from a set of axioms one can extract a program M : T(A) such that 
RCFP proves Vd € data([M]) dr A7~ from the same set of axioms. 


Thms. 5 and 1, together with and classical soundness (see Sect. 4.3), yield: 
Theorem 6 (Program Extraction). From a CFP proof of an admissible for- 
mula A from a set of axioms one can extract a program M : 7(A) such that for 


any computation M = Mo ey My an ., Uien(Mi)p realizes A~ in every model 
of the axioms. 


6 Application 


As our main case study, we extract a concurrent conversion program between 
two representations of real numbers in [-1, 1], the signed digit representation and 
infinite Gray code. In the following, we also write d : p for Pair(d, p). 

The signed digit representation is an extension of the usual binary expansion 
that uses the set SD 2 {—1,0,1} of signed digits. The following predicate S(x) 
expresses coinductively that x has a signed digit representation. 


S(x) = |z| < 1A 5d € SDS(22 — d), 


with SD(d) 2 (d = —1 V d = 1) V d = 0. The type of S is 7(S) = 3” where 


g Def (1+ 1)+1 and 6” Re! fixa.d x a, and its realizability interpretation is 


pr S(x) = |a|<1AI4d€SD3p' (p=d:p' ^p rS(2x-— d)) 


which expresses indeed that p is a signed digit representation of x, that is, 
p= do : dı : ... with dj € SD and z = ¥7,d;2~-“*). Here, we identified the 
three digits d = —1, 1,0 with their realizers Left (Left), Left(Right), Right. 
Infinite Gray code ([18,42]) is an almost redundancy free representation of 
real numbers in [-1, 1] using the partial digits {—1,1, L}. A stream p = do : 
dı : ... of such digits is an infinite Gray code of x iff d; = sgb(t’(x)) where 
t is the tent function t(x) = 1 — |2a| and sgb is a multi-valued version of the 
sign function for which sgb(0) is any element of {—1, 1, L} (see also Example 4). 
One easily sees that t’(x) = 0 for at most one i. Therefore, this coding has 
little redundancy in that the code is uniquely determined and total except for at 
most one digit which may be undefined. Hence, infinite Gray code is accessible 
through concurrent computation with two threads. The coinductive predicate 


G(x) = |x| < 1A D(z) AG(t(2)), 
where D is the predicate D(z) Re y #0 = (x < 0Vx > 0) from Exam- 


ple 2, expresses that x has an infinite Gray code (identifying —1,1,L with 
Left, Right, L). Indeed, 7(G) = 2” and 


pr G(a) = |z| < 1A5d,p'(p = d : p'A(x 40 > dr (x < OVx > 0)) Ap’ r G(t(x))). 
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In [12], the inclusion S C G was proved in IFP and a sequential conversion 
function from signed digit representation to infinite Gray code extracted. On 
the other hand, a program producing a signed digit representation from an in- 
finite Gray code cannot access its input sequentially from left to right since it 
will diverge when it accesses L. Therefore, the program needs to evaluate two 
consecutive digits concurrently to obtain at least one of them. With this idea in 
mind, we define a concurrent version of S as 


S2(x) = |z| < 1A || (Gd € SD S2(2z — d)) 


with 7(S2) = fixa.A(3 x a) and prove G C S2 in CFP (Thm. 7). Then we 
can extract from the proof a concurrent algorithm that converts infinite Gray 
code to signed digit representation. Note that, while the formula G C Sg is not 
admissible (it contains || at an F-position), the formula S2(x) is. Therefore, if 
for some real number x we can prove G(x), the proof of G C Sz will give us 
a proof of S2(x) to which Theorem 6 applies. Since S2(x)~ is S(x), this means 
that we have a nondeterministic program all whose fair computation paths will 
result in a (deterministic) signed digit representation of x. 

Now we carry out the proof of G C Sg. For simplicity, we use pattern match- 
ing on constructor expressions for defining functions. For example, we write 
f (a:t) Ref M for f 22 Ax. case x of {Pair(a,t) > M}. 

The crucial step in the proof is accomplished by Example 4, since it yields 
nondeterministic information about the first digit of the signed digit represen- 
tation of x, as expressed by the predicate 


ConSD(z) 2 || ((x < 0V æ > 0) V |z| < 1/2). 


Lemma 6. G C ConSD. 
Proof. G(x) implies D(x) and D(t(a)), and hence ConSD, by Example 4. 


The extracted program gscomp : 2” = A(3) uses the program conSD defined in 


Example 4: 
D 


gscomp (a:b: p) 2f conSD (Pair(a, b)). 

We also need the following closure properties of G: 
Lemma 7. Assume G(x). Then: 
(1) G(t(x)), G(|2|), and G(—z2); 
(2) ifx > 0, then G(2a — 1) and G(1 — x); 
(3) if |x| < 1/2, then G(2z). 
Proof. This follows directly from the definition of G and elementary properties 
of the tent function t. The extracted programs consist of simple manipulations 


of the given digit stream realizing G(a), concerning only its tail and first two 
digits. No nondeterminism is involved. A detailed proof is in [11]. 


Theorem 7. G C So. 
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Proof. By coinduction. Setting A(z) Ref ade SD G(2x — d), we have to show 


G(x) > |z| < 1A I (AQ). (5) 
Assume G(x). Then ConSD(), by Lemma 6. Therefore, it suffices to show 
ConSD(z) > || (A(x)) (6) 
which, with the help of the rule (Conc-mp), can be reduced to 
(a <O0Va>O0V |æ| < 1/2) > A(x). (7) 


(7) can be easily shown using Lemma 7: If x < 0, then t(x) = 2x + 1. Since 
G(t(x)), we have G(2x — d) for d = —1. If x > 0, then G(2x — d) for d = 1 by 
(2). If |z| < 1/2, then G(2x — d) for d = 0 by (3). 


The program onedigit : 2” = 3 > 3 x 2” extracted from the proof of (7) 
from the assumption G(x) is 


onedigit (a : b : p) c Ref case cof {Left(d) > case dof { 
Left(_) > Pair(—1,b : p); 
Right(_) — Pair(1, (not b) : p)}; 
Right(_) > Pair(0,a: (nh p))} 


not a ‘case a of {Left (_) > Right; 
Right(_) — Left} 
nh (a: p) Pef (not a): p 
This is lifted to a proof of (6) using mapamb (the realizer of (Conc-mp)). Hence 
the extracted realizer s : 2” = A(3 x 2”) of (5) is 


sp 2e: mapamb (onedigit p) (gscomp p) 


The main program extracted from the proof of Theorem 7 is obtained from 
the step function s by a special form of recursion, commonly known as coiteration. 
Formally, we use the realizer of the coinduction rule COIND(@s, , G) where Ps, 
is the operator used to define G as largest fixed point, i.e. 


bg, 2 AX Aw |x| < 1A || (Ad € SD X (2x — d)). 


The realizer of coinduction (whose correctness is shown in [12]) also uses a pro- 
gram mon: (ax > ay) > A(3xax) > A(3xay) extracted from the canonical 
proof of the monotonicity of ®s,: 


mon f p = mapamb (mon’ f) p 
where mon’ f (a:t)=a:ft 
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Putting everything together, we obtain the infinite Gray code to signed digit 
representation conversion program gtos : 2” => fixa.A(3 x a) 


gtos = (mon gtos) o s 


Using the equational theory of RIFP, one can simplify gtos to the following 
program. The soundness of RIFP axioms with respect to the denotational se- 
mantics and the adequacy property of our language guarantees that these two 
programs are equivalent. 


gtos (a:b: t) = Amb( 
(case a of {Left(_) > —1: gtos (b : t); 
Right(_) > 1 : gtos((not b) : t)}), 
(case bof {Right(_) > 0: gtos(a : (nh ¢))})). 
Left(_) > L})). 


In [43], a Gray-code to signed digit conversion program was written with 
the locally angelic Amb operator that evaluates the first two cells a and b in 
parallel and continues the computation based on the value obtained first. In that 
program, if the value of b is first obtained and it is Left, then it has to evaluate a 
again. With globally angelic choice, as the above program shows, one can simply 
neglect the value to use the value of the other thread. Globally angelic choice also 
has the possibility to speed up the computation if the two threads of Amb are 
computed in parallel and the whole computation based on the secondly-obtained 
value of Amb terminates first. 


7 Implementation 


Since our programming language can be viewed as a fragment of Haskell, we can 
execute the extracted program in Haskell by implementing the Amb operator 
with the Haskell concurrency module. We comment on the essential points of 
the implementation. The full code is available from [3]. 

First, we define the domain D as a Haskell data type: 


data D = Nil | Le D | Ri D | Pair(D, D) | Fun(D -> D) | Amb(D, D) 


The ~»-reduction, which preserves the Phase I denotational semantics and re- 
duces a program to a w.h.n.f. with the leftmost outermost reduction strategy, 
coincides with reduction in Haskell. Thus, we can identify extracted programs 
with programs of type D that compute that phase. 

The ~» reduction that concurrently calculates the arguments of Amb can 
be implemented with the Haskell concurrency module. In [19], the (locally an- 
gelic) amb operator was implemented in Glasgow Distributed Haskell (GDH). 
Here, we implemented it with the Haskell libraries Control.Concurrent and 
Control.Exception as a simple function ambL :: [b] -> I0 b that concur- 
rently evaluates the elements of a list and writes the result first obtained in a 
mutable variable. 
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Finally, the function ed :: D -> IO D produces an element of data(a) from 
a € D by activating ambL for the case of Amb(a,b). It corresponds to ne 
reduction though it computes arguments of a pair sequentially. This function is 
nondeterministic since the result of executing ed (Amb a b) depends on which 
of the arguments a,b delivers a result first. The set of all possible results of ed a 
corresponds to the set data(a). 

We executed the program extracted in Section 6 with ed. As we have noted, 
the number 0 has three Gray-codes (i.e., realizers of G(0)): a = L:1:(-1), 
b= 1:1:(—1)”, and c = —1:1:(—1)”. On the other hand, the set of signed digit 
representations of 0 is AU BUC where A = {0%}, B = {0*:1:(—1)” | k > 0}, 
and C = {0*:(-1):1% | k > 0}, i.e., AU BUC is the set of realizers of S(0). 
One can calculate 


gtos(a) = Amb(L,0: Amb(L,0:...)) 


and data(gtos(a)) = A. Thus gtos(a) is reduced uniquely to 0:0:... by the 
operational semantics. On the other hand, one can calculate data(gtos(b)) = 
AUB and data(gtos(c)) = AUC. They are subsets of the set of realizers of S(0) 
as Theorem 5 says, and gtos(b) is reduced to an element of AU B as Theorem 6 
says. 

We wrote a program that produces a {—1,1, L}-sequence with the speed of 
computation of each digit (—1 and 1) be controlled. Then, apply it to gtos and 
then to ed to obtain expected results. 


8 Conclusion 


We introduced the logical system CFP by extending IFP [12] with two propo- 
sitional operators B|4 and ||(A), and developed a method for extracting non- 
deterministic and concurrent programs that are provably total and satisfy their 
specifications. 

While IFP already imports classical logic through nc-axioms that need only 
be true classically, in CFP the access to classical logic is considerably widened 
through the rule (Conc-lem) which, when interpreting B|4 as A > B and identi- 
fying ||(A) with A, is constructively invalid but has nontrivial nondeterministic 
computational content. 

We applied our system to extract a concurrent translation from infinite Gray 
code to the signed digit representation, thus demonstrating that this approach 
not only is about program extraction ‘in principle’ but can be used to solve 
nontrivial concurrent computation problems through program extraction. 

After an overview of related work, we conclude with some ideas for follow-up 
research. 


8.1 Related work 


The CSL 2016 paper [5] is an early attempt to capture concurrency via pro- 
gram extraction and can be seen as the starting point of our work. Our main 
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advances, compared to that paper, are that it is formalized as a logic for concur- 
rent execution of partial programs by a globally angelic choice operator which 
is formalized by introducing a new connective B|4, and that we are able to ex- 
press bounded nondeterminism with complete control of the number of threads 
while [5] modelled nondeterminism with countably infinite branching, which is 
unsuitable or an overkill for most applications. Furthermore, our approach has a 
typing discipline, a sound and complete small-step reduction, and has the ability 
to switch between global and local nondeterminism (see Sect. 8.2 below). 

As for the study of angelic nondeterminism, it is not easy to develop a de- 
notational semantics as we noted in Section 2, and it has been mainly studied 
from the operational point of view, e.g., notions of equivalence or refinement of 
processes and associated proof methods, which are all fundamental for correct- 
ness and termination [28,33,27,37,16,29]. Regarding imperative languages, Hoare 
logic and its extensions have been applied to nondeterminism and proving total- 
ity from the very beginning ([2] is a good survey on this subject). [31] studies 
angelic nondeterminism with an extension of Hoare Logic. 

There are many logical approaches to concurrency. An example is an ap- 
proach based on extensions of Reynolds’ separation logic [36] to the concurrent 
and higher-order setting [34,13,25]. Logics for session types and process calculi 
[45,15,26] form another approach that is oriented more towards the formulae-as- 
types/proofs-as-programs [22,44] or rather proofs-as-processes paradigm [1]. All 
these approaches provide highly specialized logics and expression languages that 
are able to model and reason about concurrent programs with a fine control of 
memory and access management and complex communication patterns. 


8.2 Modelling locally angelic choice 


We remarked earlier that our interpretation of Amb corresponds to globally 
angelic choice. Surprisingly, locally angelic choice can be modelled by a slight 
modification of the restriction and the total concurrency operators: We simply 
replace A by the logically equivalent formula A V False, more precisely, we set 


Bi', Ref (B V False)|,4 and ||'(A) Ref \[(A V False). Then the proof rules in 
Sect. 4 with | and || replaced by |’ and ||’, respectively but without the strictness 
condition, are theorems of CFP. To see that the operator ||’ indeed corresponds 
to locally angelic choice it is best to compare the realizers of the rule (Conc-mp) 
for || and ||’. Assume A, B are non-Harrop and f is a realizer of A —> B. Then, 
if Amb(a, b) realizes ||(A), then Amb( fla, fb) realizes ||(B). This means that 
to choose, say, the left argument of Amb as a result, a must terminate and so 
must the ambient (global) computation f{a. On the other hand, the program 
extracted from the proof of (Conc-mp) for ||’ takes a realizer Amb(a, b) of ||'(A) 
and returns Amb((upo f odown)Ja, (upo f odown){b) as realizer of ||'(B), where 


up and down are the realizers of B + (BV False) and (AV False) > A, namely, 


up Ret Aa. Left(a) and down Pef Ne. case cof {Left(a) — a}. Now, to choose 


the left argument of Amb, it is enough for a to terminate since the non-strict 
operation up will immediately produce a w.h.n.f. without invoking the ambient 
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computation. By redefining realizers of B|4 and ||(A) as realizers of B|} and 
\\'(A) and the realizers of the rules of CFP as those extracted from the proofs of 
the corresponding rules for |’ and ||’, we have another realizability interpretation 
of CFP that models locally angelic choice. 


8.3 Markov’s principle with restriction 


So far, (Rest-intro) is the only rule that derives a restriction in a non-trivial way. 
However, there are other such rules, for example 


Va € N(P(x) V 7=P(a)) 


= Rest-Markov 
dr € N P(2)|32en P(z) 


If P(x) is Harrop, then (Rest-Markov) is realized by minimization. More pre- 
cisely, if f realizes Vz € N(P(x) V —P(x)), then min(f) realizes the formula 
dx € NP(#)|aren p(x), Where min(f) computes the least k € N such that 
f k = Left if such k exists, and does not terminate, otherwise. One might expect 
as conclusion of (Rest-Markov) the formula 3z € N P(#)|(<+3azen P(2))- However, 
because of (Rest-stab) (which is realized by the identity), this wouldn’t make a 
difference. The rule (Rest-Markov) can be used, for example, to prove that Har- 
rop predicates that are recursively enumerable (re) and have re complements are 
decidable. From the proof one can extract a program that concurrently searches 
for evidence of membership in the predicate and its complement. 


8.4 Further directions for research 


The undecidability of equality of real numbers, which is at the heart of our case 
study on infinite Gray code, is also a critical point in Gaussian elimination where 
one needs to find a non-zero entry in a non-singular matrix. As shown in [10], our 
approach makes it possible to search for such ‘pivot elements’ in a concurrent 
way. A further promising research direction is to extend the work on coinductive 
presentations of compact sets in [41] to the concurrent setting. 
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Abstract. We study principles and models for reasoning inductively 
about properties of distributed systems, based on programmed atomic 
handlers equipped with contracts. We present the Why3-do library, lever- 
aging a state of the art software verifier for reasoning about distributed 
systems based on our models. A number of examples involving invariants 
containing existential and nested quantifiers (including Dijsktra’s self- 
stabilizing systems) illustrate how the library promotes contract-based 
modular development, abstraction barriers, and automated proofs. 


1 Introduction 


The formal verification of properties of distributed algorithms and protocols is 
an important and notoriously difficult activity. The dominant approaches are: 
(i) Automatic exploration of the state space, known as model checking [LOJ4), 
a technique that can be used for both safety and liveness properties, expressed 
using variants of temporal logic. Its application to distributed systems is a consol- 
idated area that has held many significant results. However, the state explosion 
phenomenon means that in practice only systems of modest size can be verified. 
(ii) Deductive reasoning based on the use of inductive invariants. A number of 
tools now exist for the verification of single-threaded systems based 
on first-order logic (FOL), loop invariants, and contracts, with solid theoretical 
foundations [21[16]. Reasoning about distributed systems using inductive invari- 
ants was, until recently, mostly a pen-and-paper activity, but tools like Verdi [42], 
IronFleet [20], and Ivy [84] have made significant advances to this state of things 
(see Section [7] for details). Relying on external provers (and in the case of Iron- 
Fleet, on the Dafny verifier to check the sequential code), these tools support 
verification of asynchronous message-passing systems based on atomic handlers, 
reusable network/fault models, and different abstract specification mechanisms. 

Based on the same principles, we propose in this paper a conceptual contract- 
based framework for reasoning about distributed systems, as well as the Why3-do 
library for the Why3 verifier [18]. Distinctive aspects of our approach include 
the following: 


— It allows for reasoning about distributed systems using a standard program 
verification tool (rather than a dedicated tool or a proof assistant), and 
methods and techniques that are standard for sequential software. 
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— Systems and protocols are described algorithmically by means of programmed 
handlers equipped with contracts that guarantee the inductiveness of invari- 
ants. Thus Why3-do brings modular development using the popular pro- 
gramming by contract methodology to the scope of distributed systems. 

— Why3-do offers other system models in addition to message-passing. We 
illustrate this in this paper by describing a locally shared memory model. 

— It takes advantage of Why3’s state of the art proof management (including 
replayability, bisection of hypotheses, and inconsistency detection); ability to 
interact with all major proof tools (automated and interactive); and internal 
transformations that allow for a combination of interactive and automated 
development, avoiding the use of proof assistants for inductive proofs. 


Contributions of the Paper. We contribute to the state of the art of dis- 
tributed system verification, and in general to software verification with Why3: 

(i) We introduce (Section |3) principles for modular verification of distributed 
systems based on clonable models, capturing in a uniform way different system 
semantics. Each model declares a set of handlers equipped with contracts. 

(ii) We present (sections|4] [5] 6) a Why library with different system models and 
fault semantics. A concrete system is defined by cloning a model and defining 
its handlers and invariants. Handler implementations are required to respect the 
contracts declared in the model, which in particular ensures inductiveness of the 
invariants. Although Dafny contracts can also be used in IronFleet, the novelty 
in Why3-do is the presence of dedicated contracts in the library models, that 
are used to automatically generate verification conditions when cloning. 

(iii) We introduce (Section a model-independent specification mechanism 
based on system traces, to act as abstraction barrier between specification (ob- 
servable properties) and implementation. Traces are a common specification 
mechanism; the novelty here is the support for modular development through 
the use of model-independent clonable specification modules; different implemen- 
tations can be given for a specification, using different system models. 

(iv) We present (Section [6) a locally-shared memory model illustrating how our 
approach is applied uniformly beyond message-passing models. As far as we are 
aware Verdi, IronFleet and Ivy work with message-passing systems only. 

(v) We formalize and verify one of Dijsktra’s self-stabilizing systems [I5] and 
verify its closure (safety) and convergence (liveness) properties using Why3-do. 
This verification is of independent interest: our proof of convergence, using a 
measure function, takes advantage of SMT solvers and significantly improves on 
previous, much more laborious efforts using proof assistants (Section [6}. 

(vi) We propose two techniques for reasoning with inductive invariants contain- 
ing existential and nested quantifiers: stepwise bounded validation (Section (6), 
and the use of dual definitions containing both code and logic (sections [4] and 
(6). Together with Why3’s ability to interact with multiple solvers with different 
strengths, dual definitions allow for more robust and natural specifications, as 
well as for easier automated proofs, without the need for tricks like quantifier 
hiding [20]. Both techniques are explained by means of examples. 
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module MapList 
use int.Int, list.List, list.Mem, list.Length, list .NthNoOpt 


val function f (x:int) : int requires {x >= 0} ensures {result >= 0} 
predicate nonNeg (1:list int) = forall x :int. mem x 1 -> x >= 0 


let rec map_list (l:list int) : list int 
requires { nonNeg 1 } 
ensures { nonNeg result /\ forall j. 0<=j<length 1 -> nth j result = f(nth j 1) } 
variant { 1 } 
= match 1 with 
| Nil -> Nil 
| Cons h t -> Cons (f h) (map_list t) 
end 
end (* module MapList *) 


module MapFib 
use int.Int, list.List, list.Mem, list.Length, list.NthNoOpt, ref.Ref 


inductive fibpred int int = 

| zero : fibpred 0 0 

| one : fibpred 1 1 

| oth : forall n r1 r2 :int. n>=2 -> fibpred (n-1) ri /\ fibpred (n-2) r2 -> fibpred n (ri+r2) 


let function calcfib (m:int) : int 
requires { m >= 0 } 
ensures { result >= 0 /\ forall r. fibpred m r <-> r=result } 
= let n = ref 0 in let x = ref 0 in let y = ref 1 in 
while !n < m do 
invariant { 0 <= !n <= m /⁄\ !x >= 0 /\ ly >= 0 } 
invariant { forall r. (fibpred !n r <-> r = !x) /\ (fibpred (!n+1) r <-> r = !y) } 
variant { m - !n } 
let tmp = !x in x := ly; y := !y+tmp; n := Inti; 
done; 
lx 


clone MapList with val f = calcfib 
lemma mapFib_lm: forall 1:list int.nonNeg l-> let fibl = map_list 1 in 


nonNeg fibl /\ forall j.0<=j<length l-> nth j fibl = calcfib (nth j 1) 
end (* module MapFib *) 


Listing 2.1. Why3 example 


All the models and example modules mentioned in the paper are available 
for experimentation in the Why3-do artifact [28]. 


2 The Why3 Languages in a Nutshell 


The example in Listing [2.1] illustrates the use of Why3’s logic and program- 
ming languages, as well as the module cloning mechanism. The MapList module 
first imports a number of theories for mathematical integers and lists from the 
standard library. Why3 includes a wide range of theories, usable across provers. 
A program function f is then declared with the val keyword, including a sim- 
ple contract: a precondition requiring its argument to be nonnegative, and a 
postcondition stating that the result is also nonnegative. In the rest of the mod- 
ule this contract will be assumed to hold for f. Next, a logic predicate nonNeg 


Why3-do: The Way of Harmonious Distributed System Proofs 117 


is defined. It uses a universal quantifier to state that every element of its ar- 
gument list is nonnegative. Finally, the map_list program function is defined. 
The definition includes both the function’s recursive definition and a contract, 
in particular a postcondition that uses a universal quantifier to state the map- 
ping property (result refers to the return value). From this module, Why3 will 
generate verification conditions (VCs) ensuring that the definition is consistent 
with its contract, assuming the definition of f keeps to its own contract. This 
interplay between contracts plays a fundamental role in deductive verification. 

This little example allows us to elaborate on another aspect of Why3. nonNeg 
is also a function (returning a truth value), but it lives in a different namespace 
from map_list, which isa WhyML program function. nonNeg belongs to Why3’s 
logic language [I7], and its definition contains a quantifier, which cannot be 
used in programs. However, pure program functions, which do not modify the 
global state, may also be used in the logic, if their declaration includes the 
function keyword. This is the case of f, used in both the code and the contract 
of map_list. We will refer to program functions that can be used in the logic as 
“let functions”. map_list is also pure, but is not declared as a let function. 

Whys encodes both the code and contracts of let functions, so one may choose 
to write certain logic functions algorithmically or logically, or both. For instance 
nonNeg could be defined alternatively as follows (the postcondition is optional): 

let rec predicate nonNeg (1:list int) 
ensures { result <-> forall x :int. mem x 1 -> x >= 0 } 


= match 1 with 
| Nil -> true | Cons h t -> h>=0 && nonNeg t end 


If the postcondition is present, the logic encoding of the predicate will contain 
redundancy (no inconsistency can be created since the definition must respect the 
contract). Writing such “dual definitions” of logic functions may be a good idea 
for a number of reasons, namely the possibility of including preconditions, and 
termination checks based on user-provided variants. Moreover, dual definitions 
increase the robustness of specifications and may facilitate automated proofs 
of results involving quantifiers. Not every logic function can be defined as a 
let function: since the latter must remain executable, they may not contain 
for instance occurrences of logic equality or quantifiers. In these cases let ghost 
functions can be used. These are pure logic definitions that are not meant to be 
executed, but are still written as programs. 

A second module, MapFib, defines a program function calcfib that com- 
putes Fibonacci numbers using a loop. The recursive definition of the Fibonacci 
sequence (used in the function and loop invariant of calcfib) cannot be written 
as a logic function, since it is not total. It could be defined as a let function with 
a precondition restricting its domain, but we use instead an inductive predicate 
fibpred: the formula fibpred n f means that f is the nth. Fibonacci number. 
Inductive predicates, familiar to readers acquainted with proof assistants, are de- 
fined by means of a set of inference rules. They are used in our models to define 
non-deterministic transition relations on distributed system configurations. 

Why3 will generate and successfully discharge VCs ensuring the correctness 
of calcfib with respect to its contract. Now, since calcfib is in accordance with 
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the contract of f in MapList, this module can be cloned instantiating the latter 
function with the former. This imports into the current module a copy of every 
element of MapList, with calcfib substituted for f, and generates refinement 
VCs, to ensure that calcfib’s contract is stronger than f’s. Finally, the lemma 
mapFib_1m states that indeed map_list maps the function calcfib as expected. 


3 Distributed Systems and Models 


A distributed system consists of a set N of nodes, each of which can at any 
moment be in a state taken from a set X, together with additional elements, 
such as a communication network or a shared memory. We will call the global 
state of such a system a world and denote by W the set of all worlds. In general, 
worlds will include the local state of every node in the system, captured as 
a mapping IS : N > X. Different models will specialize this basic setting to 
define different notions of distributed system (and consequently also of world), 
including for instance different communication and fault models (we will always 
write N, X, or W in the context of a specific system model, left implicit). 

Models are handler-based: systems are described by writing code executed 
by nodes in response to certain events, such as receiving a message from the 
network or an input from the local environment, or simply being enabled by a 
guard predicate that becomes true. Handlers are assumed to execute atomically. 
Each model defines a transition semantics describing how worlds evolve step by 
step, allowing for all possible schedules (both locally and globally). Each model 
contains a set of rules inferring judgments of the form w ~ w’, meaning that 
the system’s global state w evolves to w’. The general form of the rules states 
the following: if the world w' results from w when a handler is executed by one 
of the system’s nodes, then w ~~ w. 

Let wo correspond to the initial state of the system, and ~+* denote the 
reflexive-transitive closure of ~~. A world w is said to be reachable if wo ~>* w. 
Let ® be some property of worlds; we will write w = © to signify that ® is 
satisfied by the world w . A system is said to be correct with respect to ® if 
w = © holds for every reachable world w. A typical correctness proof involves 
finding an inductive invariant: a property I such that (i) wo = J, and (ii) for 
every pair w, w’ of worlds, if w = I and w ~ w’, then w’ — J. If w E I implies 
w = @, this is sufficient to guarantee correctness. 


Contract-based Models. We introduce the use of handler contracts for designing 
and verifying distributed systems. Let us consider a model with worlds of the 
form (IS,...), with ... standing for other components of worlds in addition to 
the state function. The signature and contract of a handling function will be of 
the following general form, where J is a candidate invariant predicate, and other 
arguments and return values (...) may be present: 


handle(n : N,IS: N > X,...): (0: 3...) 
requires /(IS,...) 
ensures [(IS[n > o],...) 
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The function returns the new state o of the node n that executes the handler in 
a world with state function IS. This general form will be adapted with modifica- 
tions in different models. For instance, handling functions may have access only 
to the local state and not to the entire state function IS, or they may return, 
in addition to a new state, a list of messages to be sent by n. Transition rules 
have the following general form, updating the state of the node that executes 
the handler, and reflecting in the world other effects of the execution. 


handle(n, IS, ...) = (0,...) 
(IS...) ~> (IS[n > o],...) 


The handler’s contract, consisting of precondition I(IS,...) and postcondition 
I(|S[n + o},...), ensures that if the handler is executed in a world satisfying the 
invariant J, then the world resulting from this transition still satisfies I. 

It is common for handlers to have access only to the state ø of the node n 
where they are being executed. In this case it is not possible to include I(IS,...) 
as a precondition in the contract, since IS is not passed as a parameter. Preser- 
vation of the invariant can be written instead as a conditional postcondition, 
stating that for every world satisfying I in which o is the state of node n and 
this node executes the handler, then the resulting world still satisfies I: 


handle(n : N,o: X,...): (07: XY...) 
ensures Vis. nN". 0 =lSn > I(IS,...) > I(IS[n oJ, ...) 


The Why3-do Library. Listing [3.1] illustrates how contract-based models are 
written as Why3 modules. The World module declares basic types and func- 
tions, and defines the world structured type. The Steps module includes val 
declarations for (i) the initial world, (ii) an inductive invariant predicate, and 
(iii) a set of handling functions (illustrated here by handle_1). Contracts en- 
force that the inductive invariant is satisfied by the initial world, and preserved 
by handlers. Each handler’s contract makes use of a step_1 auxiliary function, 
that is also used in the definition of the transition semantics through the step 
inductive predicate. The module ends with the definition of reachable world, and 
a lemma stating that the invariant holds in all reachable worlds (this is proved 
inductively for each model, using proof transformations and SMT solvers). 
That is all that is required to define a system model, which may now be cloned 
to produce concrete distributed systems. Listing [3.2] illustrates how simple this 
is. We write a System module that defines, first of all, types for nodes, states, 
messages, and other relevant elements, and if appropriate, well-formedness pred- 
icates for different entities. The World module from the desired Why3-do library 
model can then be cloned, after which the following are defined: (i) the initial 
world, (ii) a candidate inductive invariant predicate, and (iii) handler functions 
specifying the behavior of the system’s nodes/processes. The Steps module from 
the same model is now cloned, instantiating these elements. Why3 will produce 
a set of VCs, generated from the contracts contained in the cloned module, en- 
suring that the invariant is inductive. Properties of interest can at last be stated 
and proved (which may involve writing additional definitions and lemmas). 
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module World (* file model.mlw *) 

type node 

type state 

type world = (map node state, ...) 

function localState (w:world) : map node state = (* projection functions for worlds *) 
let (1S, ...) =w in 1S 


end (* module World *) 
module Steps (* file model.mlw *) 


val function initState (node) : state (* init functions for world components *) 
constant initWorld : world = (initState, ...) 


val ghost predicate indpred (w:world) 
ensures { w=initWorld -> result } (* initial world must satisfy invariant *) 


(* specifying the new world that results from w when n executes a handler yielding results r *) 


function step_1 (w:world) (n:node) (r:(state, ...)) : world = 
let (st, ...) =r in 
let newLocalState = set (localState w) n st in 
(newLocalState, ...) 


(* handlers’ arguments include a node h and its state; results include a new state for h *) 
val function handle_1 (h:node) (sig:state) ... : (state, ...) 
ensures { forall w :world. indpred w -> sig = localState wh -> ... -> 
indpred (step_1 w h result) } 


inductive step world world = 
| step_1 : forall w :world, n :node. 

step w (step_1 wn (handle_1 n (localState w n) ...)) 
| 


inductive step_TR world world = 
| base : forall w :world. step_TR w w 
| step : forall w w? w? :world. step_TR w w? -> step w? w? -> step_TR w w’? 


predicate reachable (w:world) = step_TR initWorld w 
(* inductive invariant holds in all reachable worlds *) 


lemma indpred_reachable : forall w :world. reachable w -> indpred w 
end (* module Steps *) 


Listing 3.1. Basic structure of a Why3-do model 


4 The Basic Message-Passing Model 


In this model nodes communicate by exchanging packets: triples of the form 
(d,s, m), carrying a message m € Msg from node s € N to node d € N, with 
Msg a given set of messages. Worlds are pairs (IS, nt) where IS: N > X isa 
function assigning a state to each node and nt : Msg” is a network, abstracted as 
a list of packets. In a system based on this asynchronous model, nodes execute 
a message handler whenever they receive a message, and may in turn send 
messages to other nodes. The handleM function implements this local message- 
handling behavior. Its parameters include the node h handling the message, the 
node that sent the message, the state of the handling node, and the message 
itself. It returns a new state for h and a list of packets to be sent to the network. 
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module System (* file system.mlw *) 

type node = int 

type state = int 

clone model.World with type node, type state 

let function initState (m:node) : state =... 

let ghost predicate indpred (w:world) = ... 

let function handle_1 (h:node) (1S:map node state) : state =... 

clone model.Steps with type node, type state, val initState, val indpred, val handle_1 


goal systemProperty : forall w :world. reachable w -> ... 


end (* module System *) 


Listing 3.2. Basic structure of a Why3-do system module 


Its signature and contract are (with J a candidate invariant): 


handleM(h: N,s:N,m: Msg,o: X) : (o’ : X, nt’ : Msg*) 
ensures Vis:N-+5,nt:Msg*. 9 = ISh > (h, s,m) € nt 
—> IIS, nt) + I(IS[h + oJ, nt’ + nt — {(h, s,m)})) 


The semantics of the model are given by the following transition rule: 


handleM(h, s,m, IS(h)) = (a, nt’) (h, s,m) € nt 
(IS, nt) ~> (IS[h > a], nt’ + nt — {(h,s,m)}) 


(message) 


We use notation +, —, and € for list concatenation, difference, and membership. 
Any packet that is in transit in the network may be selected by the rule to be 
delivered and handled by the receiving node. The rule removes the packet from 
the network, updates the state of the handling node, and sends new packets as 
prescribed by the handler. The semantics takes into account all possible orders 
of message delivery, since any message may be extracted from the packet pool. 
The semantics is otherwise idealized, but the library contains additional models 
in which messages may be dropped or duplicated by the network (an example 
verification of a system assuming message duplication is given in Section [5). 
The contract of handleM ensures that executions of (message) preserve the 
invariant I. Let ok!(handleM) signify that the implementation of the handler 
adheres to its contract, instantiated with the candidate invariant J. If J holds in 
the initial world then it is indeed inductive and holds in all reachable worlds: 


Lemma 1. Let wo,w € W and I be a predicate such that ok’(handleM). If 
wo = I and wo ~* w then w =T. 


A simplified version of the corresponding Why3-do model is shown in List- 
ing The World module defines the tuple types packet and world and 
auxiliary functions. Steps declares the following elements to be instantiated 
when cloning: the ok_Msg well-formedness predicate; initState and initMsgs, 
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module World 

type node type state type msg 

type packet = (node, node, msg) 

function dest (p:packet) : node = let (d,_,_)=p ind 

function src (p:packet) : node = let (_,s,_)=p in s 

function payload (p:packet) : msg = let (_,_,m)=p in m 

type world = (map node state, list packet) 

function localState (w:world) : map node state = let (1S,_)=w in 1S 
function inFlightMsgs (w:world) : list packet = let (_,ifM)=w in ifM 
end (* module World *) 


module Steps 
predicate ok_Msg (node) (node) (msg) 


val function initState (node) : state 
val constant initMsgs : list packet 
constant initWorld : world = (initState, initMsgs) 


val ghost predicate indpred (w:world) 
ensures { w=initWorld -> result } 
ensures { result -> forall p: packet. mem p (inFlightMsgs w) -> 
ok_Msg (dest p) (src p) (payload p) } 


function step_message (w:world) (p:packet) (r:(state, list packet)) : world 
= let (st, ms) = r in let localState = set (localState w) (dest p) st in 
let inFlightMsgs = ms ++ (remove p (inFlightMsgs w)) in (localState, inFlightMsgs) 


val function handleMsg (h:node) (s:node) (m:msg) (sig:state) : (state, list packet) 
requires { ok_Msg h s m } 
ensures { forall w :world. indpred w -> mem (h, s, m) (inFlightMsgs w) -> 
sig = localState w h -> indpred (step_message w (h, s, m) result) } 


inductive step world world = 
| step_msg : forall w :world, p :packet. mem p (inFlightMsgs w) -> 
step w (step_message w p 
(handleMsg (dest p) (src p) (payload p) (localState w (dest p)))) 


inductive step_TR world world ae 
predicate reachable (w:world) = step_TR initWorld w 


lemma indpred_reachable : forall w :world. reachable w -> indpred w 
end (* module Steps *) 


Listing 4.1. Message-passing model: mode1MP 


used to construct initWorld; the inductive invariant indpred; and finally the 
handleMsg handler. The contract of indpred ensures that it is satisfied by 
the initial world, and that all messages in the network are well-formed. Well- 
formedness conditions are singled out from the invariant because the handler 
function may need to assume basic facts about messages. The module ends with 
lemma indpred_reachable, corresponding to Lemma [1] (the ok? (handleM) and 
wo |= I premises are enforced by the contracts of indpred and handleMsg). It is 
proved using a Why3 transformation for predicate induction, and SMT solvers. 


Example: Leader Election on a Ring. Leader Election is a coordination problem, 
where a set of processes or nodes collectively designate one of them to act as 
leader. One of the simplest solutions to this problem on a unidirectional ring 
network is the maximum-finding distributed algorithm devised by Chang and 
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Roberts [7]. Let each node have a distinct identifier of some type equipped with 
a total order relation. Informally the algorithm can be described as follows: (i) 
messages are node identifiers; each node starts by sending its id to the next node 
in the ring. (ii) Each node then enters a message-handling loop. If a received 
message has a higher value than the receiver’s id, the message is forwarded to 
the next node. Otherwise, it is discarded. (iii) If a node receives back a message 
with its own id, it claims to be the leader. The fundamental property to be 
proved of this system is that at most one node claims to be leader. The system 
has been used as example in [34] and later in [29]. The Ivy description of the 
system is based on the decidable EPR fragment of FOL (See Section|7), whereas 
our formalization below uses unrestricted quantification. 


The Why3-do encoding of this algorithm is given in Listing based on 
the mode1MP library model. The first step is to define types for nodes, identifiers, 
states, and messages. Identifiers are uniquely associated to nodes by means of the 
id function and the uniqueIds axiom. The constant n_nodes is the number of 
nodes in the ring. A minimum of 3 nodes is assumed, with no upper bound. The 
constant maxId_global corresponds to the (unique) node having the highest- 
value id in the ring. Node states are records having a single field leader of 
Boolean type, which indicates when a node claims to be leader. The ok_Msg 
predicate describes the notion of well-formed message in the ring topology. 


The types for nodes and identifiers could be left undefined, with a set of 
axioms for the next function and the maxId_global constant. But in our expe- 
rience, using library types, as well as defined constants, predicates, and functions 
when adequate, is advantageous from the point of view of provability, and also re- 
duces the danger of introducing inconsistencies. For instance the maxId_global 
constant is defined algorithmically using a recursive let function maxId_fn with 
a “dual definition” (it is equipped with a contract describing precisely what it 
does). We could instead simply write an axiom concerning maxId_global, but 
using the dual definition let function, containing code, not only increases the 
degree of assurance in what is being specified, but also makes it easier to reason 
about, since Why3 will generate a more easily provable set of VCs. 


Cloning the module modelMP.World introduces new composed types and 
auxiliary definitions. The system description then proceeds to give the initial 
conditions of the system, by means of a state function initState, and a con- 
stant initMsgs for the list of messages that are sent upon booting, also defined 
by means of a recursive let function. The handler definition then follows. The 
next element in the module is the invariant indpred, defined as a let predicate 
(since logic elements like quantifiers and equality are required, it is defined as a 
let ghost predicate using an auxiliary predicate inv, see Section 2). It states 
that every inflight message is well-formed; it contains the id of some node in the 
ring, with value not less than the sender’s id, and it is not the id of any node 
i such that maxId_global is located between i and the message’s destination 
node (an auxiliary predicate between is used to express this). Moreover if the 
message contains its destination’s id then that id is the highest in the network. 
Finally, any node that is claiming to be the leader has the highest id in the ring. 
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type node = int 

val constant n_nodes : int 

axiom n_nodes_ax : 3 <= n_nodes 

let function next (x:node) : node = mod (x+1) n_nodes 


type id = int 
val function id (mode) : id 
axiom uniqueIds : forall i j :node. id i = id j <-> i=j 


let rec function maxId_fn (n:int) : node 
requires { 1 <= n <= n_nodes } 
ensures { 0 <= result < n} 
ensures { forall k :node. 0<=k<n -> k<>result -> id k < id result} 
variant { n } 
= if n=1 then 0 
else let m = maxId_fn (n-1) in if id (n-1) > id m then n-1 else m 


constant maxId_global : id = maxId_fn n_nodes 
type state = { leader : bool } 


type msg = id 
predicate ok_Msg (dest:node) (src:node) (m:msg) = 
0 <= dest < n_nodes /\ 0 <= src < n_nodes /\ dest = next src 


clone mode1lMP.World with type node = node, type state = state, type msg = msg 
let function initState (i:node) : state = { leader = false } 


let rec function initMsgs_fn (n:node) : list packet 
requires { 0<=n<=n_nodes } 
ensures { forall s d :node, m :msg. mem (d, s, m) result -> 
m = ids /\ d = next s /\ n<=s<n_nodes /\ 
(forall i :node. between i maxId_global d -> m <> id i) /\ 
(m = id d -> d = maxId_global) } 
variant { n_nodes-n } 
= if (0<=n<n_nodes) then Cons (next n, n, id n) (initMsgs_fn (n+1)) 
else Nil 


let constant initMsgs : list packet = initMsgs_fn 0 


let function handleMsg (h:node) (src:node) (m:msg) (s:state) : (state, list packet) 
= if m = (id h) then ({ leader = true }, Nil) 
else if m > id h then (s, Cons (next h, h, m) Nil) 
else (s, Nil) 


predicate between (lo:node) (i:node) (hi:node) = 
(lo < i < hi) V (hi < lo < i) V (i < hi < Lo) 


lemma btw_next_lm : forall i j k :node. 
0 <= i < n_nodes -> 0 <= j < n_nodes -> 0 <= k < n_nodes -> i <> k -> 
between (next i) j k -> between i j k 


predicate inv (1S:map node state) (iFM:list packet) = 
(forall s d :node, m :msg. mem (d, s, m) iFM -> 
(ok_Msg d s m /\ m >= ids /\ 
(exists i :node. 0 <= i < n_nodes /\ m = id i) /\ 
(forall i :node. between i maxId_global d -> m <> id i) /\ 
(m = id d -> d = maxId_global) )) /\ 
(forall i:node. 0<=i<n_nodes -> (1S i).leader = true -> i = maxId_global) 


let ghost predicate indpred (w:world) = inv (localState w) (inFlightMsgs w) 


clone modelMP.Steps with type node, type state, type msg, predicate ok_Msg, 
val initState, val initMsgs, val indpred, val handleMsg 


goal uniqueLeader : 
forall w :world, i j:node. 
reachable w -> O<=i<n_nodes -> 0<=j<n_nodes -> 
(localState w i).leader = true -> (localState w j).leader = true -> i = j 


Listing 4.2. Leader election on a ring (Chang-Roberts) 
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The module then clones the Steps module from modelMP instantiating the 
necessary elements, and formulates the uniqueLeader proof goal. The verifica- 
tion results depend on the provers that are available. In our setup we were able to 
prove automatically all VCs using the Alt-Ergo [I], CVC4 [5], and Vampire [36] 
SMT solvers after (i) providing lemma btw_next_1m, proved automatically by 
Alt-Ergo; and (ii) including in the postcondition of function initMsgs_fn the 
relevant facts relating in-transit messages and maxId_global, as required by 
the invariant. Observe that this postcondition is proved automatically by the 
program verification engine following the recursive definition of the function. 


5 Trace Specifications 


In the previous section we have considered a specification property expressed at 
the implementation level, with access to internal node states. Other internal ele- 
ments of worlds, including messages, could be mentioned in such implementation- 
level properties. It is however very useful to introduce an abstraction barrier be- 
tween specifications and implementation details. This can be achieved by logging 
certain observable events onto a trace of the system, and then writing specifi- 
cations as properties of the trace. Models in our setting can be equipped with 
traces, allowing for protocols and systems to be specified in this way. 

We will illustrate this by equipping the message-passing model of Section 
with traces. Each system using this model defines an Out type of outputs, and 
the model defines external events as Evt = N x Out, outputs paired with the 
node that originated them (other models may use additional notions of external 
event, such as inputs received by nodes from their local environments). A trace 
is a sequence of external events; the function rec : N + Out* — Evt* produces 
a trace from a sequence of outputs, pairing them with the source node. Given a 
predicate v on traces and T € Evt“, we will write 7 — v when 7 satisfies v. 

A commit specification (fp, pf) Consists of a predicate p(X, X) and a func- 
tion uf(X, X) : Out”, expressing respectively when outputs should be produced, 
and what those outputs should be. The signature of the message handler is sim- 
ilar to that in the model of Section |4| with a trace as additional output. Its 
contract states that it complies with a given commit specification. 


handleM(h :N,s:N,m:Msg,o:) : (o/: X, nt!:Msg*,1 :Out*) 
ensures Vjs:N->5,nt:Msg*- C = ISh —> (h, s,m) € nt 

KIS, nt) + I(IS[h > oJ, nt’ + nt — {(h, s,m)})) 
ensures ({i,(0,0’) + 1 = uslo, o')) A (ALp(a, 0’) >l =e) 


We will write ok! “°F (handleM) to signify that the implementation of handleM 
adheres to its contract, with invariant 7 and commit specification (4p, wf). 

Worlds are tuples (IS, nt,7) with IS: N > X, nt : Msg”, and 7 : Evt*. The 
semantics will now be given by the relation ~C W x N x W, with w ~n w 
meaning that world w transitions to w’ with node n executing a handler. The 
following transition rule commits outputs to the trace: 
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handleM(h, s,m, IS(h)) = (a, nt’, 1) (h, s,m) € nt 
(IS, nt, T)~n(IS[h > o], nt’ + nt — {(h, s,m)}, rec, (l) +7) 


(message) 


A specification is a triple (up, uf, v) consisting of a commit specification and 
a predicate v(Evt*) expressing some notion of trace consistency. Correctness 
implies that the commit specification is respected and traces are consistent. 


Definition 1. A system with initial world wo € W is said to be correct with 
respect to a specification (up, uf, v) if 


1. for all w = (IS,nt, T) € W, w = (IS’,nt’,7’) € W and n E€ N such that 
wo * W ~n W, if Up(IS(n),IS'(n)) then 7! = recep (pp (IS(n), IS'(n))) + 7, 
otherwise T' =T 

2. T |= v for every world w = (IS, nt, T) € W such that wo ~* w 


Lemma 2. Let (up, uf, v) be a specification, and I a predicate such that 
ok! HF (handleM), wo = I, and for every world w = (IS, nt, T), w EI implies 
T Ev. Then the system is correct with respect to (Up, Hf, v). 


As usual the lemma is proved mechanically in the Why3-do module for this 
model. Every Why3-do model extended with traces contains a similar lemma. 

A simplified version of the modelMPTrace model is shown in Listing 5.1] (sas 
indicate elements that are preserved from the modelMP module). The world 
type extends the tuple of modelMP with a trace of type list externalEvent. 
The functions/predicates commitp, commitf, and consistent, corresponding 
respectively to up, uf, and vy, are to be instantiated when cloning the model. The 
indpred inductive predicate gains a new postcondition ensuring that it enforces 
consistency of the system’s trace (following the conditions of Lemma [2}. The 
step inductive predicate is modified to include as an additional parameter the 
node involved in each transition. The commit_step and consistent_reachable 
lemmas (mechanically proved, using the contracts of indpred and handleMsg) 
together correspond to Lemma 2] above. 


Example: Distributed Lock. This example will show how Why3-do models can 
be extended in a flexible way. Its verification was first carried out in [20] and 
later also in and [29]. We adapt it here to make use of trace specifications, 
which will allow us to demonstrate their effectiveness as an abstraction barrier. 
In addition to traces, the example also illustrates the use of guarded actions in 
models (through the use of enabling predicates), as well as the use of a non- 
idealized network model, in which in-transit messages can be duplicated. Two 
implementations will be given: one that is in accordance with the trace spec if the 
idealized model is used, and a second implementation that tolerates duplicating 
messages. The specification of the distributed lock system is the following: 


1. the state of each node must include information on whether it is holding a 
lock (a Boolean), together with the lock’s current epoch (an integer); 
2. whenever a node acquires a lock it outputs its current epoch; 
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module World 

type externalEvent ssa 

type world = (map node state, list packet, list externalEvent) 
function trace (w:world) : list externalEvent = let (_,_,t)=w in t 
end (* module World *) 


module Steps 

type output 

type externalEvent 

val function record_outputs (n:node) (outs:list output) : list externalEvent 
predicate commitp (state) (state) 

function commitf (state) (state) : list output 

predicate consistent (t:list externalEvent) 


val ghost predicate indpred (w:world) 
ensures { ... /\ result -> consistent (trace w) } 


function step_message (w:world) (p:packet) (r:(state, list packet, list output)) : world = 
let (st, ms, outs) = r in let localState = set (localState w) (dest p) st in 
let inFlightMsgs = ms ++ (remove p (inFlightMsgs w)) in 
let trace = (record_outputs (dest p) outs) ++ (trace w) in 
(localState, inFlightMsgs, trace) 


val function handleMsg (h:node) (s:node) (m:msg) (sig:state) : (state, list packet, list output) 
requires { ... } 
ensures { ... /\ let (s’,_,lo) = result in (commitp s s? -> 
lo = commitf s s’) /\ (not (commitp s s?) -> lo = Nil) } 


inductive step world node world = 
| step_msg : forall w :world, p :packet. 
mem p (inFlightMsgs w) -> step w (dest p) (step_message w p 
(handleMsg (dest p) (src p) (payload p) (localState w (dest p)))) 


lemma commit_step : 
forall w w? :world, n :node. reachable w -> step w n w?’ -> 
(commitp (localState w n) (localState w? n) -> 
trace w? = (record_outputs n (commitf (localState w n) (localState w? n))) ++ trace w) 
/\ (not (commitp (localState w n) (localState w’ n)) -> trace w’ = trace w) 


lemma consistent_reachable : 
forall w :world. reachable w -> consistent (trace w) 
end (* module Steps *) 


Listing 5.1. Message-passing model: mode1MPTrace 


3. in every reachable world an output n is stored in position n of the trace. 


The system’s trace stores the sequence of outputs sent by different nodes. To- 
gether, these requirements mean that a node acquiring the lock at epoch n writes 
to position n of the trace, which implies (since traces are only modified by ap- 
pending at the head) that no two nodes acquire the lock in the same epoch. 
Specifications are written as Why3-do modules defining the output and 
externalEvent types, together with projection and the record_outputs func- 
tions. Most importantly, they define the commitp and consistent predicates, as 
well as the commitf function. However, the specification is abstract and does not 
impose the use of any specific system model. It requires the presence of certain 
types, but does not specify how the types are implemented. The requirement 
that states should contain specific information is included by declaring functions 
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module Spec 
(* to be instantiated when cloning this module *) 
type node 
type state 
function getEpochS (s:state) : int 
predicate getHeldS (s:state) 


type output = | Locked int 

function getEpoch0O (o:output) : int = 
match o with | Locked e -> e end 

type externalEvent = (node, output) 

function node (e:externalEvent) : node = let (n,_) =e inn 

function outp (e:externalEvent) : output = let (_,o) =e ino 

let rec function record_outputs (n:node) (outs:list output) : list externalEvent 
ensures { forall i :int. 0<=i<length outs -> nth i result = (n, nth i outs) } 


predicate commitp (s:state) (s’:state) = not (getHeldS s) /\ getHelds s’ 
function commitf (_:state) (s’:state) : list output = Cons (Locked (getEpochS s’)) Nil 
predicate consistent (t:list externalEvent) = 
match t with 
| Nil -> true 
| Cons (_,0) tt -> getEpochO o = length t /\ consistent tt 
end 
end (* module Spec *) 


Listing 5.2. Specification module for distributed lock 


and/or predicates on states. Implementation modules will define these types and 
functions and clone the specification module, instantiating them. 

This specification of the distributed lock is written as the Why3-do module 
of Listing [5-2] It assumes the use of a system model defining types node, state, 
output, and externalEvent. The above requirements are formalized as follows: 


1. the functions getEpochS and getHeldS express required state information; 

2. the output type has a single constructor carrying an integer; externalEvents 
are outputs paired with nodes; the commitp predicate states that outputs 
are produced when the state of a node changes from not holding to holding a 
lock, and the commitf function returns a list with the node’s current epoch; 

3. the consistent predicate uses the list length function to require that the 
output stored in each position n of the trace contains epoch n. 


We will consider two message-passing implementations for this specification 
based on a ring topology, shown in listings [5.3] and Node states are records 
with two fields: a Boolean held indicating whether the node holds the lock, and 
its current epoch. After the appropriate type definitions, both implementation 
modules clone the same Spec module, and then the World module from the ap- 
propriate model. The idealized model mode1MPEnabledTrace is used in the im- 
plementation of Listing[5.3] whereas Listing[5.4juses model1MPEnabledTraceDup1 
in which messages can be duplicated. Both are extensions of modelMPTrace (List- 
ing |5.1) with an enabling predicate. Enabling predicates allow for nodes to ex- 
ecute guarded actions: when cloning the model, the enabled predicate (with a 
node and its state as parameters) and the handleEnb1d function are instantiated; 
the semantics states that the handler may be executed whenever the predicate 
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type node = int 

val constant n_nodes : int 

axiom n_nodes_ax : 2 <= n_nodes 

let function next (x:node) : node = mod (x+1) n_nodes 


type state = { held : bool; epoch : int } 
function getEpochS (s:state) : int = epoch s 
predicate getHeldS (s:state) = held s 


type msg = int 
predicate ok_Msg (dest:node) (src:node) (_:msg) = 
O<=dest<n_nodes /\ O<=src<n_nodes /\ dest = next src 


clone specLDT.Spec with type node, type state, function getEpochS, predicate getHelds 


clone modelMPEnabledTrace.World with type node, type state, 
type msg, type output, type externalEvent 


et function initState (n:node) : state 
= let h = if n=0 then true else false in 
let e = if n=0 then 1 else O in 
{ held = h; epoch = e } 
et constant initMsgs : list packet = Nil 
et constant initTrace : list externalEvent = Cons (0,Locked(1)) Nil 


et function handleMsg (_:node)(_:node) (m:msg) (s:state) :(state, list packet, list output) 
= if (not (held s) ) then ({ held = True; epoch = m }, Nil, Cons (Locked m) Nil) 
else (s, Nil, Nil) 


et ghost predicate enabled (s:state) (i:node) 
= 0<=i<n_nodes && held s 


et function handleEnbld (h:node) (s:state) : (state, list packet, list output) 
= let e = epoch s in ({ held = False; epoch = e }, Cons (next h, h, e+1) Nil, Nil) 


et rec ghost predicate zeroHeld (1S:map node state) (m:int) =... 
et rec ghost predicate oneHeld (1S:map node state) (n:int) 
et rec ghost predicate oneMsg (lp:list packet) = length lp = 
et rec ghost predicate noMsgs (lp:list packet) length lp 


I 
pis 


Ii 
© 


et rec ghost predicate ok_trace (t:list externalEvent) 
ensures { result -> consistent t } 
= match t with 
| Nil -> true 
| Cons (_,0) Nil -> getEpoch0 o = 1 
| Cons (_,01) os -> 
match os with 
| Nil -> true 
| Cons (_,02) _ -> getEpoch0 o1=(getEpoch0 02)+1 && ok_trace os 
end 
end 


predicate inv (1S:map node state) (iFM:list packet) 
(tr:list externalEvent) 
= (forall p: packet. mem p iFM -> ok_Msg(dest p)(src p) (payload p)) 
/\ (ConeMsg iFM /\ zeroHeld 1S n_nodes) 
\/ (noMsgs iFM /\ oneHeld 1S n_nodes)) 
/\ (forall n :node. O<=n<n_nodes -> held (1S n) -> 
n = node (hd tr) /\ epoch (1S n) = getEpochO(outp (hd tr))) 
/\ (forall p: packet. mem p iFM -> 
src p = node (hd tr) /\ payload p=getEpochO(outp (hd tr))+1) 
/\ length tr > 0 /\ ok_trace tr 


let ghost predicate indpred (w:world) 
= inv (localState w) (inFlightMsgs w) (trace w) 


clone modelMPEnabledTrace.Steps with ... 


Listing 5.3. Distributed lock with idealized model 
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let function handleMsg (_:node) (_:node) (m:msg) (s:state) 
: (s’?:state, lp:list packet, lo:list output) 
= let nop = (s, Nil, Nil) in 
if (held s) || m <= epoch s then nop 
else ({ held = True; epoch = m }, Nil, Cons (Locked m) Nil) 


(* helper definitions for invariant predicate *) 

let rec ghost predicate zeroHeld (1S:map node state) (n:int) ... 

let rec ghost predicate atMostOneHeld (1S:map node state) (n:int)... 
let rec ghost predicate isFresh (p: packet) (1S:map node state)... 
let rec ghost predicate allStale (1S:) (lp:list packet)... 

let rec ghost predicate atMostOneFresh (1S:...)(lp:...)... 

let rec ghost predicate ok_trace (t:list externalEvent)... 


predicate inv (1S:map node state) (iFM:list packet) 
(tr:list externalEvent) 
= (forall p: packet. mem p iFM -> ok_Msg (dest p)(src p) (payload p)) 
/\ atMostOneFresh 1S iFM /\ atMostOneHeld 1S n_nodes 
/\ (zeroHeld 1S n_nodes \/ allStale 1S iFM) 
/\ (forall n :node. 0<=n<n_nodes -> held (1S n) -> 
n = node (hd tr) /\ epoch (1S n) = getEpochO(outp (hd tr))) 
/\ (forall p: packet. mem p iFM -> isFresh p 1S -> 
src p = node (hd tr) /\ payload p = getEpochO(outp (hd tr))+1) 
/\ length tr > 0 /\ ok_trace tr 


Listing 5.4. Distributed lock with duplicating messages model 


is true. In the present example, enabled is defined as true when a node holds a 
lock, in which case it is free to release it. The lock is released when handleEnbld 
executes, sending a message to the next node in the ring. The message includes 
the value of the sender’s current epoch, incremented by one. 

The system is initialized with node 0 holding the lock (and this fact is reg- 
istered in the system trace). The handling functions then follow. The enabling 
predicate and the corresponding handler are the same in both implementations; 
it is in the message handlers that they differ. With the idealized model nodes 
can trust that messages are never stale, so they react by blindly acquiring the 
lock. With the duplicating model the receiving node first checks whether the 
epoch in the received message is higher than its present epoch (in which case 
it cannot be a stale copy of a previous message). The inductive invariants are 
also different for both implementations, but both include a property expressed 
with the ok_trace predicate, stating that events in the trace contain incremen- 
tal epochs, starting from 1. This implies consistency of the trace (as defined in 
the specification), and is easier to check for inductiveness. 

Let us consider in detail the system of Listing A message is fresh if 
the current epoch of its destination node is lower than the message. Transfer 
messages are always sent from the highest epoch node (holding the lock) and 
thus, at the time of sending, the destination has a lower epoch, which will be 
updated when the message is received and the lock acquired. Other copies of the 
message are stale because their destinations’ epochs have since increased. The 
system’s invariant is given as the conjunction of the following properties, using 
the zeroHeld, atMostOneHeld, allStale, and atMostOneFresh predicates: (i) 
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in-transit messages are well-formed; (ii) there is at most one in-transit fresh 
message, and at most one node holding a lock; if a node holds a lock then all 
in-transit messages are stale; (iii) If node n holds the lock then the last Locked x 
was written in the trace by n, and z is the current epoch of n; (iv) if there exists a 
fresh in-transit message, then it was sent by the last node that output Locked x, 
and it carries the value x + 1; (v) the trace obeys the ok_trace predicate. 

The VCs generated for the modules of listings [5.3] and [5.4] proved automati- 
cally, establish the correctness of each system with respect to the specification of 
Listing [5.2] events are being logged in the specified way, and traces are consistent. 


6 Locally Shared Memory Model 


Dijkstra described certain distributed systems (including the self-stabilizing sys- 
tems described below) using a guarded processes model, in which nodes/pro- 
cesses do not exchange messages, but instead have direct read access to each 
other’s states. Although particular systems will only require read access to a 
limited set of states (typically its immediate neighbors’), our model allows read 
access universally. This is not a shared-memory model in all generality, but it 
may be implemented over shared memory, with a single-writer multiple-reader 
data structure for each node’s state (and readers—writer locks for atomicity). 
We formalize this in our setting as a model where worlds are simply of the 
form (IS) with IS: N > X a state-assigning function. A system based on this 
model is programmed by defining an enabling predicate on nodes and a han- 
dling function describing the behavior that can be executed whenever a node 
is enabled. Formally we will consider that the enabling predicate has signature 
ep(n: N,IS: N > X), taking as parameters a node and a global state assigning 
function, and the handling function has the following signature and contract: 


handleE(h: N,IS: N > X): (a: X) 
requires ep(h, IS) A I(IS) 
ensures [(IS[h +> o]}) 


The enabling predicate and the handler code have read access to every node’s 
state, but the handler may only modify the state of the node where it is running. 
This semantics is given by the following rule: 
handleE(h, |S) = o ep(h, |S) 
(IS) ~p (IS[h > a] 


(enabled) 


where ~>, means that node h runs the handler. The contract of handleE ensures 
that executions of the (enabled) transition rule preserve the property I (the 
contract ensures this if the node is enabled, and the semantics only allow for 
transitions satisfying this requirement). We will write ok! (ep, handleE) when the 
implementation of the handling function handleE adheres to its contract, with 
invariant / and enabling predicate ep. Listing |6.1| shows a simplified version of 
the Why3-do modelReadallEnabled module, including the following Lemma, 
proved using an induction transformation and SMT solvers. 
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module World 
type node, type state, type world = map node state 
end 


module Steps 
val predicate validNd (n:node) 
val function initState (node) : state 
constant initWorld : world = initState 


val ghost predicate indpred (w:world) 

ensures { w=initWorld -> result } 

val ghost predicate enabled (map node state) (i:node) 
requires { validNd i } 


function step_enbld (w:world) (n:node) (st:state) : world = set wn st 


val function handleEnbld (h:node) (1S:map node state) : state 
requires { validNd h /\ enabled 1S h /\ indpred 1S } 
ensures { indpred (step_enbld 1S h result) } 


inductive step world node world = 
| step_enbld : forall w :world, n :node. validNd n -> enabled w n -> 
step w n (step_enbld w n (handleEnbld n w)) 


lemma indpred_step : 

forall w w? :world, n :node. step w n w? -> indpred w -> indpred w’ 
lemma step_preserves_states : 

forall w w? :world, n i :node. step w n w? -> i<on >wi=w i 


(* keeps track of number of transition steps *) 
inductive step_TR world world int = 
| base : forall w :world. step_TR w w 0 
| step : forall w w? w°?’ :world, n :node, steps :int. 
step_TR w w? steps -> step w? n w°? -> step_TR w w’? (steps+1) 


lemma noNeg_step_TR : forall w w’ :world, steps :int. step_TR w w? steps -> steps >= 0 
lemma indpred_manySteps : 
forall w w? :world, steps :int . step_TR w w? steps -> indpred w -> indpred w’ 


predicate reachable (w:world) = exists steps :int. step_TR initWorld w steps 
lemma indpred_reachable : forall w :world. reachable w -> indpred w 
end 


Listing 6.1. Locally shared memory model: modelReadallEnabled 


Lemma 3. Let wo,w € W, with ep and I predicates such that ok! (ep, handleE), 
wo = I, and wo ~* w. Then w } I. 


Example: Stabilizing Mutual Exclusion. Self-stabilizing systems [15/38] are de- 
signed to tolerate failures resulting from “horrible errors” (such as data cor- 
ruption), by including a recovery mechanism. Given some notion of legal con- 
figuration, a system is said to be self-stabilizing if (i) starting from an illegal 
configuration, all executions eventually converge to a legal configuration (a live- 
ness property), and (ii) legal configurations are closed under normal execution 
steps, i.e. no illegal configuration is reachable if no corruption of data occurs 
(a safety property). One of Dijkstra’s examples of such a system in his seminal 
paper was a directed ring of processes sharing a resource, with mutual exclu- 
sion enforced by means of a circulating token. Legal configurations are those in 
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module SelfStab_Ring_Closure 
type node = int 
val constant n_nodes : int 
axiom n_nodes_bounds : 2 < n_nodes 
et predicate validNd (nm:node) = 0 <= n < n_nodes 
type state = int 
val constant k_states : int axiom k_states_lower_bound : n_nodes < k_states 
et function incre (x:state) : state = mod (x+1) k_states 


clone modelReadallEnabled.World with type node, type state 

et function initState (n:node) : state = if n=n_nodes-1 then 1 else 0 

predicate has_token (1S:map node state) (i:node) = 

(i = 0 /\ 18 i= 1S (m_nodes-1)) WV (i > O /\ i < n_nodes /\ 18 i <> 1S (i-1)) 


et ghost predicate enabled (1S:map node state) (i:node) = has_token 1S i 


et function handleEnbld (h:node) (1S:map node state) : state 
= if h = 0 then incre (1S (n_nodes-1)) else 1S (h-1) 


et rec ghost predicate atLeastOneToken (1S:map node state) (n:int) 
requires { validNd n } 

ensures { result <-> exists k :int. 0<=k<n /\ has_token 1S k } 
variant { n } 

= n > 0 && (has_token 1S (n-1) || atLeastOneToken 1S (n-1)) 


predicate atMostOneToken (1S:map node state) (n:int) = validNd n -> 
forall i j :int. 0<=i<n -> 0<=j<n -> has_token 15 i -> has_token 15 j -> i=j 


lemma first_last : forall n: int, 1S :map node state. 
n >= 0 -> (forall j :int. 0<j<=n -> 1S j = 1S (j-1)) -> 1S 0 = 1S n 
lemma atLeastOneTokenLm : forall w :world. atLeastOneToken w n_nodes 


predicate inv (1S:map node state) = 
(forall n :int. validNd n -> 0 <= 1S n < k_states) /\ atMostOneToken 1S n_nodes 
let ghost predicate indpred (w:world) = inv w 


clone modelReadallEnabled.Steps with type node, type state, 
val validNd, val initState, val indpred, val enabled, val handleEnbld 


predicate oneToken (w:world) = atMostOneToken w n_nodes /\ atLeastOneToken w n_nodes 
goal oneToken : forall w :world. reachable w -> oneToken w 
end 


Listing 6.2. Self-stabilizing mutual exclusion on a ring — Closure 


which exactly one process carries a token. In case of failure the system converges 
back into a single-token configuration. Dijkstra’s proposal for self-stabilizing mu- 
tual exclusion was the following: processes have integer numbers in {0,... kK —1} 
as states, with K greater than the size of the ring. Each process observes the 
state of its predecessor in the ring; the process with index 0 holds a token when 
its state is the same as that of its predecessor (the last process in the ring); 
other processes hold a token when their state is different from their predeces- 
sor’s. When holding a token, each process may modify its state by copying its 
predecessor’s state; node 0 additionally increments (modulo K) this state. 
Listing shows the Why3-do formalization of this system, based on the 


locally shared memory model. Nodes and states are both integers; n_nodes and 
k_states are the size of the ring and the number of different states. The en- 


for) 
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abling predicate is defined as true for a node exactly when it is carrying a token, 
as specified by the has_token predicate. The handler defined by handleEnbld 
copies states as previously described. Mutual exclusion is expressed using pred- 
icates atLeastOneToken and atMostOneToken that apply to the first n nodes. 

The module of Listing verifies the closure property. The invariant ex- 
presses that node states are within bounds, and there is no more than one token 
in the ring. One possible (legal) initial configuration of the system is described 
by the initState let function. These definitions are instantiated when cloning 
modelReadallEnabled. The module ends with the oneToken goal, stating that 
there exists exactly one token in all reachable configurations. 


Stepwise Bounded Validation. In the verification of closure we use the following 
technique: we introduce an axiom bounding the size of the system, passed to 
the solvers to make automated proofs easier (soundness of the verification may 
be compromised at this point). We then introduce parts of the invariant step 
by step, and check them in this bounded system in order to gain insight as to 
their validity. Once we feel confident about the elected invariant, we remove 
the bounding axiom to achieve soundness of the verification, possibly stating 
additional lemmas or strengthening the invariant. For the present system: 


1. We started with the following invariant. Inductiveness is proved automatically, 
but the oneToken goal cannot be proved from it (as expected): 

forall i :int. validNd i -> 0 <= 1S i < k_states. 

2. Next, we included atMostOnetoken 1S n_nodes in the invariant; preservation 
was proved automatically, but oneToken could still not be proved. We then added 
a bounding axiom n_nodes <= 10, which allowed the goal to be proved. 

3. We strengthened the invariant with atLeastOnetoken 1S n_nodes and removed 
the bounding axiom. The oneToken goal was proved trivially; however, the VC 
pertaining to the preservation of the invariant could not be proved. 

4. Preservation could be proved by reintroducing a bound on n_nodes (with a 
bound of 1000, all VCs could be proved within 30 seconds in our setup). 

These bounded proof results indicate that, in all likelihood, (i) the property 
atLeastOnetoken 1S n_nodes is preserved by system transitions, and thus induc- 
tive, but (ii) it is not necessary to include it in the inductive invariant to prove 
oneToken: in our development the oneToken goal could be proved for a number 
of processes up to 10 without including the former property in the invariant. The 
reason for this is that in fact the atLeastOnetoken 1S n_nodes property is satis- 
fied by definition in all configurations: in order for a token to be present, either 
any two adjacent processes have different states, or the first and last processes 
have the same state. If all processes have the same state, then the second case 
holds. Including the property in the invariant still requires a bound (to prove 
preservation), but this can now have a much higher value (1000 rather than 10). 

An unbounded proof is obtained by including in the module the first_last 
lemma (proved by induction on n). This allows for the goal to be proved au- 
tomatically without atLeastOnetoken 1S n_nodes in the invariant, and with 
no upper bound on n_nodes. We remark that the dual definition (recursive + 
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TLAPS Verdi TronF leet Ivy Why3-do 
Contract-based design v (partial) v 
DS models generic MP MP MP MP; LSM 
Reusable Models v v v 
Different fault models v V 
Verified system transforms v 
Abstract Specifications state machines;} observ.| state machines; observ. traces 
spec to protocol] traces | spec to protocol (model- 
refinement refinement independent) 
Liveness properties v (TLC) v (TLC) 
Logic TLA+ FOL FOL EPR FOL 
Invar. discovery support va 
Automated provers multiple Z3 Z3 multiple 
Proof assistants multiple Coq multiple 
Programming language PlusCal Gallina | state machines; RML WhyML 
(F) | Dafny (F/I) (F/D) 
Implementation support UDP model/ mutable/machine 
machine types (WhyML) types 
Generation of executables v v 
Table 7.1. Comparison of DS deductive verification frameworks 
MP: message-passing, LSM: locally shared memory, F: functional, I: imperative 


contract) of the atLeastOneToken let function was crucial for proving the goal 
automatically (this was not possible with a logic definition). 

The convergence property is more challenging; its Why3-do formalization 
can be found in the artifact [28]. We have also verified Dijkstra’s version of this 
system with a bidirectional array topology. Bounded exploration again allowed 
us to validate parts of the invariant; attaining an unbounded verification required 
strengthening the invariant, rather than a lemma. 


7 Related Work 


Deductive verification methods are typically based on first-order logic reasoning 
and focus on safety properties, with correctness proofs requiring users to man- 
ually provide appropriate invariants and to discharge (either automatically or 
interactively) proof obligations generated in the process. Invariants may apply 
to loops, recursive functions, or non-deterministic transition relations, and al- 
low for correctness proofs by induction on the length of executions. In the last 
few years a number of frameworks and tools have been proposed for reasoning 
about asynchronous message-passing systems using inductive invariants, based 
on atomic handler models and different specification mechanisms. We will now 
briefly survey these and compare them with Why3-do in terms of design choices. 

Verdi [42] introduced the use of models based on worlds and atomic handlers, 
with models capturing different fault semantics. Why3-do’s semantic framework 
is inspired by Verdi; we enrich handlers with interface specifications in the form 
of contracts, allowing for the use of methods that are standard in deductive 
verification of single-thread software. Verdi is a Coq development, and reasoning 
is carried out within the Coq proof assistant [22]. The implementation of our 
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framework as a Why3 library allows for the use of automated tools (all the proofs 
in this paper use SMT solvers and a few Why3 transformations). 


Whereas Verdi handlers are defined in a purely functional style, in Why3- 
do they are written in WhyML, combining functional and imperative features. 
Verdi supports system transformations that allow for verified systems to be ob- 
tained from systems verified with simpler models (additional mechanisms may 
be automatically introduced to compensate for the presence of faults). Trans- 
formations are verified once and for all, so the resulting systems do not need to 
be verified. An important difference is that Verdi targets exclusively message- 
passing systems, whereas Why3-do covers different system models. Verdi sup- 
ports traces, but specifications may not be written in a completely abstract, 
model-independent way. In Why3-do this is achieved through the use of clonable 
specification modules defining commit specifications and trace consistency. 


The IronFleet [20] platform is built on top of a deductive verification tool, 
Dafny [26], which uses the Z3 [BI] SMT solver for proofs. Like Verdi, it supports 
only message-passing systems. A major difference with respect to Why3-do and 
Verdi is that, instead of a specification mechanism based on traces, IronF leet sep- 
arates development in a specification level (where worlds are viewed abstractly) 
and a concrete protocol level, both described in FOL as state machines. A refine- 
ment function [I] maps protocol worlds to the specification level, and a refine- 
ment proof shows that protocol steps are compatible with the abstract behavior 
(in Why3-do this is achieved by trace consistency proofs). There is a third, im- 
plementation level, where event handlers are programmed using mutable data 
structures and machine types, for performance and realism. IronFleet extends 
Dafny with a UDP specification to support networking, which allows non-atomic 
handlers to be developed assuming low-level interleaving. In order to establish 
refinement proofs between low-level implementations and protocols, reduction- 
based reasoning is supported. IronF leet also includes an embedding of TLA that 
makes possible reasoning about liveness properties. It is overall an ambitious 
tool that has been used by its authors to verify practical systems. 


Up to a point Why3-do implementations cover both the protocol and imple- 
mentation levels, since WhyML accommodates both functional programs and 
stateful code with mutable structures and machine types. Why3 supports code 
extraction from verified WhyML programs, and it should not be difficult to ob- 
tain a distributed implementation from a verified Why3-do system, using one of 
the available OCaml libraries. Our framework allows for diverse system models, 
with different implementation infrastructure requirements. In general each node 
must run a scheduler that will, for instance, receive incoming local inputs and 
messages from the network, check enabling predicates, and run the appropriate 
handlers, reflecting locally and globally the effects prescribed by the semantics. 


The Ivy tool [34] differs from Why3-do and the previous frameworks in sev- 
eral important ways. It uses a dedicated modeling/programming language called 
RML, and a logic language restricted to the effectively propositional (EPR) class 
of formulas, whose satisfiability is decidable (Ivy also uses Z3). Specifications 
may refer to any part of the model (no specification/protocol distinct layers or 
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observation traces are used). The use of EPR imposes severe restrictions: RML 
does not allow arithmetic operations, so for instance a ring topology cannot be 
modeled using integer modulo arithmetic. A verification methodology based on 
the use of EPR, and details on how it has been used to verify variants of the 
PAXOS protocol, are extensively described in (the method proposed for re- 
ducing quantifier alternation is of general interest, even when unrestricted FOL 
is used). Leveraging the decidability of the logic, Ivy focuses on assisting the 
user in writing the protocol and its specification, and in discovering adequate 
inductive invariants. A few initial steps of execution are first considered, which 
may allow for bugs to be found in the protocol and/or target properties; Ivy 
then assists the user in finding an inductive invariant by performing interactive 
strengthening and generalization steps, and representing states visually. 


A more general, comprehensive framework for reasoning about distributed 
systems has been constructed around the TLA+ specification language, based 
on the Temporal Logic of Actions [25]. TLA+ is without any doubt a widely suc- 
cessful toolset, and its adoption in practice is well documented [32]. The toolset 
comprises the specification language itself; the PlusCal algorithmic language; the 
TLC model checker [43]; the TLAPS proof system [8]; and a development envi- 
ronment. Correctness proofs are based on the notion of refinement mapping [I]. If 
one writes a TLA+ specification and a PlusCal implementation, and then trans- 
lates the latter to TLA+, its correctness can be stated as a refinement problem, 
whose VC is itself written as a TLA+ formula. The TLAPS proof system is an 
ongoing effort but can already be used to prove many such refinements. TLAPS 
proofs [12] are constructed using both proof assistants and SMT solvers. 


Table[?.1]summarizes the distinctive aspects of the discussed tools. Addition- 
ally, the I4 technique has been proposed based on the automatic synthesis 
(by model checking) of inductive invariants for small instances of protocols, fol- 
lowed by their generalization. Invariants are checked with Ivy, and if necessary 
the process is repeated, considering a bigger instance or a pruned invariant. 
Kaizen [23] is a verified blockchain system that has been developed using an 
approach similar to IronFleet. Implementations of distributed systems that have 
been formally verified using different tools have been empirically scrutinized 
in [19]. 

Program logics for distributed systems have also been the subject of recent 
work, typically based on or inspired by concurrent separation logics [6], and 
mechanized in the Coq proof assistant. Notable examples include DIsEL [39], 
which focuses on modularity and compositionality, and Aneris [24], which in- 
cludes support for node-level concurrency in addition to inter-node reasoning. 
ModP [14] is an actor-based compositional programming framework that offers 
assume-guarantee reasoning principles to support compositional system testing. 


The self-stabilizing ring system has been verified interactively using the 
PVS [35] and Isabelle [30] proof assistants, and also by symbolic model check- 
ing [41J9]. A general framework for building certified proofs of self-stabilizing 
algorithms (using Coq) is described in [8]. 
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8 Conclusion 


In this paper we have proposed principles for contract-based verification of dis- 
tributed systems, based on a library promoting modular development. The ap- 
proach enables the use of state of the art sequential software verifiers for reason- 
ing about distributed systems, supports model-independent trace specifications, 
and is uniform across system models, beyond the message-passing setting. 

To implement these principles we have chosen the Whys verification platform. 
We have shown how specific features of Why3, such as the ability to interface 
with different solvers and the use of dual definitions, contribute to successful 
automated proofs. For instance, we were able to prove the inductiveness of an 
invariant for the leader election protocol containing a quantifier ‘alternation’ (a 
sequence of the form V4 [33], outside the decidable EPR logic). In particular, 
the Alt-Ergo and Vampire solvers were able to prove these VCs, whereas Z3 
and CVC4 failed (with a generous timeout value). On the other hand, the dual 
definition of the atLeastOneToken predicate in the self-stabilization systems, 
when the invariant included this predicate containing an existential quantifier, 
allowed Z3 or CVC4 (not the other solvers) to prove inductiveness. In neither 
case was it necessary to employ invariant quantifier hiding, as in [20]. 

Unbounded domains (nodes, messages, etc.) are typical of distributed sys- 
tems. Considering bounded systems, in combination with dual definitions, al- 
lowed us to explore the inductiveness of invariant properties before tackling the 
unbounded case (by strengthening invariants or writing lemmas). This should 
not be mistaken with the use of bounded verification in Ivy, which considers the 
first few system steps in order to debug models, or in I4, which produces finite 
quantifier-free instances of problems, amenable to model checking. 

The limitations of the framework are that, in the spirit of verification of se- 
quential programs with Why3, Why3-do targets the verification of distributed 
systems at the algorithmic level, and is not intended for reasoning about exe- 
cutable implementations (but see the discussion on implementation extraction in 
Section [7). Also, no support for reasoning with non-atomic handlers is included. 

Whys is a stable tool, actively developed by a solid team, with a growing 
user community and very low risk of obsolescence. It is being successfully used 
for formal verification in contexts as diverse as safety-critical programming [2], 
multicore schedulers [27], or blockchain smart contracts [87/40]. Why3-do brings 
Why3’s strengths in terms of usability and proof engineering to the mechanical 
verification of distributed systems, making it available to a wider community. 
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Abstract. Virtual memory is an essential mechanism for enforcing se- 
curity boundaries, but its relaxed-memory concurrency semantics has 
not previously been investigated in detail. The concurrent systems code 
managing virtual memory has been left on an entirely informal basis, 
and OS and hypervisor verification has had to make major simplifying 
assumptions. 

We explore the design space for relaxed virtual memory semantics 
in the Armv8-A architecture, to support future system-software verifica- 
tion. We identify many design questions, in discussion with Arm; develop 
a test suite, including use cases from the pK VM production hypervisor 
under development by Google; delimit the design space with axiomatic- 
style concurrency models; prove that under simple stable configurations 
our architectural model collapses to previous “user” models; develop tool- 
ing to compute allowed behaviours in the model integrated with the full 
Armv8-A ISA semantics; and develop a hardware test harness. 

This lays out some of the main issues in relaxed virtual memory 
bringing these security-critical systems phenomena into the domain of 
programming-language semantics and verification with foundational ar- 
chitecture semantics. 


1 Introduction 


Computing relies on virtual memory to enforce security boundaries: hypervisors 
and operating systems manage mappings from virtual to physical addresses to 
restrict access to physical memory and memory-mapped devices, and thereby to 
ensure that processes and virtual machines cannot interfere with each other, or 
with the parent OS or hypervisor. In a world with endemic use of memory-unsafe 
languages for critical infrastructure, and of hardware that does not enforce fine- 
grained protection, virtual memory is one of the few mechanisms one has to 
enforce strong security guarantees. This has driven interest in hypervisors and 
virtual machines, and it provides a compelling motivation for verification of the 
OS-kernel and hypervisor code that manages virtual memory to provide security. 

However, any such verification requires a semantics for the protection mech- 
anisms provided by the underlying hardware architecture. There are two major 
challenges in establishing such a semantics. First, there is its sequential intricacy: 
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virtual memory is one of the most complex aspects of a modern general-purpose 
architecture. For 64-bit Armv8-A (AArch64) it is described in a 166-page chap- 
ter of the prose reference manual [13, Ch.D5] and includes a host of features and 
options. Second, and more fundamentally, there is its relared memory behaviour. 
Hardware implementations of virtual memory use in-memory representations of 
the virtual-to-physical address mappings, represented as hierarchical page tables. 
For performance, there are dedicated cache structures for commonly used map- 
ping data, in Translation Lookaside Buffers (TLBs). Translations are used often 
—a single load instruction might need 40 or more page-table entries to translate 
its fetch and access addresses — but they are changed only rarely, and by systems 
code not user code. Architectures therefore require manual management of TLB 
caching, e.g. with specific instructions to invalidate old TLB entries that should 
no longer be used, instead of providing the simpler coherent memory abstrac- 
tion that they do for normal accesses. All this gives rise to new relaxed-memory 
effects, with subtle constraints determining when translations are required or 
forbidden to read from specific writes to the page tables, and systems code has 
to handle these appropriately to provide the desired virtual-memory abstraction 
and its security properties. 

Previous work has developed hand-written sequential semantics for some as- 
pects of address translation in Arm [57,59,58,60,44,38,41] and x86 [34,35,29,62], 
but these are at best lightly validated formalisations, and there is no well- 
validated relaxed-memory concurrency semantics of virtual memory. In the ab- 
sence of that (and of proof techniques above it), previous OS and hypervisor 
verification work, e.g. on seL4, CertiKOS, KCore, Hyper-V, the PROSPER hy- 
pervisor, and SeK VM [25,40,37,44,11,38,43,61] has had to make major simplify- 
ing assumptions, either assuming correctness of TLB management and a single- 
threaded setting (seL4), or assuming sequentially consistent concurrency with 
one of those hand-written sequential semantics, or assuming an extended notion 
of data-race-freedom (we return to the related work in §7). 

We explore the design space for Armv8-A relaxed virtual memory semantics, 
to support future systems-software verification. We contribute: 


— A description of the current Arm architectural intent as we understand it, 
and a set of design questions and issues arising from its relaxed virtual 
memory semantics (§3). 

— A relaxed virtual memory test suite, comprising of a set of hand-written 
litmus tests which illustrate the aforementioned design questions and capture 
key use cases from pKVM, a production hypervisor under development by 
Google (§4). 

— An axiomatic-style concurrency model for relaxed virtual memory in 
Armv8 (85), which to the best of our knowledge and ability captures the 
architectural intent described in §3. We also define a weaker model, moti- 
vated by the properties pKVM relies on. 

— We prove that, for stable injective page-tables, the first model collapses to 
the previous Armv8-A user-mode concurrency model (§5). 

— We extend our Isla tool [15], enabling it to compute the allowed behaviours 
of virtual memory litmus tests with respect to arbitrary axiomatic models, 
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using the authoritative Arm ASL definition of the intra-instruction semantics 
including pagetable walks (§6.1). 

— We develop a test harness that lets us run virtual-memory litmus tests bare- 
metal, albeit currently only for Stage 1 tests, and report results from running 
these on hardware (§6.2). 


Mainstream industrial architecture specifications evolve over many years, 
balancing hardware-implementation and systems-software concerns. Experience 
with “user” relaxed-memory concurrency has shown that the process of devel- 
oping rigorous semantics for arbitrary code provides a useful third input into 
this process, leading one to ask questions which help clarify the architectural 
intent. The architects, hardware designers, and system-software authors typi- 
cally have a deep understanding of the area, but there is usually not, a priori, a 
well-understood informal specification that just needs to be formalised; instead 
that needs to be iteratively and collaboratively developed. Our §3 is based on 
detailed discussion with the Arm Chief Architect (a co-author of this paper); 
on the current Arm prose documentation [13]; on discussion with the pKVM 
development team; and on our experimental testing. To the best of our knowl- 
edge, our models provide a reasonable basis for software development and for 
verification, but this paper is surely not the last word on the subject, and it 
does not give an authoritative definition of the Armv8-A architecture. The his- 
tory of relaxed-memory models shows that it typically takes multiple years, and 
gradual refinement of models, to converge on something reasonably stable for a 
production architecture or language, and even then they continue to change as 
new knowledge or features arise; with hindsight, few are definitive. Our goal here 
is rather to lay out some of the main issues, bringing this security-critical sys- 
tems code into the domain of programming-language semantics and verification, 
above foundational architecture semantics. 

We begin in §2 with an informal introduction to virtual memory in a simple 
sequential setting, to make this self-contained. This paper is necessarily con- 
densed; an extended version, with our tests, models, proofs, and Isla tooling, is 
available at https://www.cl.cam.ac.uk/users/pes20/RelaxedVM-Arm/. 


Scope and non-goals Our scope is Armv8-A virtual memory for the 64-bit 
(AArch64) architecture, aiming especially to support aspects relevant to hy- 
pervisors such as pKVM. Accordingly, we consider translation with multiple 
stages (for both hypervisor and OS), multiple levels, and the full Armv8-A intra- 
instruction semantics and translation walk behaviour (as defined by Arm in ASL 
and auto-translated to Sail [14]). Our models cover the Armv8-A ETS option as 
work in progress. We discuss some mixed-size aspects, but our models do not 
currently cover them. To keep things manageable, we do not consider hardware 
management of access flags or dirty bits, conflict aborts, FEAT_ BBM, FEAT_ CNP, 
FEAT _ XS, the interactions between virtual memory and instruction-fetch, or all 
the relaxed behaviour of exceptions, and we handle only some of the many vari- 
eties of the TLBI instruction. We focus on the specification of the architecturally 
allowed envelope of functional behaviour, not on side-channel phenomena. We 
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include some experimental testing, as a sanity check of our models, but our prin- 
cipal goal is to capture the architectural intent, and our principal validation is 
from discussion with Arm. Many of the issues should also be relevant to other 
architectures, but here we address only Armv8-A. 


2 Background: A Crash Course on Virtual Memory 


2.1 Virtualising addressing 


In conventional computer systems, the underlying memory is indexed by physical 
addresses (PAs), as are memory-mapped devices. For a small microcontroller 
running trusted code, accessing resources directly via physical addresses may 
suffice. Larger systems rely heavily on virtual addressing: they interpose one or 
more layers of indirection between virtual addresses (VAs) used by instructions 
and the underlying physical addresses. This lets them: 


l. partition resources among different programs, giving each access only to 
those it needs; 

2. provide convenient numeric ranges of virtual addresses to each program; and 

3. dynamically extend and change the mapping from virtual to physical ad- 
dresses, e.g. to support copy-on-write, swapping, or shared buffers. 


A simple system might have many processes managed by an operating system, 
each of which (including the OS) has a partial function that gives the physical 
address and permissions for the virtual addresses it can use, roughly: 


translate : VirtualAddress — PhysicalAddress x g{ReadWrite,Execute} 


Typically each process would have access to a subset of the physical addresses 
(the range of its translate function), disjoint from those of the other processes 
and from that of the OS, while the OS would have sole access to its own working 
memory and also access to that of the processes. This is implemented with a 
combination of hardware and system software. The hardware memory manage- 
ment unit (MMU) automatically translates virtual to physical addresses when 
doing an access needed to execute an instruction. If the function is undefined, 
the instruction traps with a page fault; if it is defined but does not have the ap- 
propriate accesses, it traps with a permission fault; and if it is defined with the 
right permissions, the hardware performs the required access using the resulting 
physical address. The OS has to set up the translate functions, ensure that the 
appropriate function is used when switching to a new process, and handle those 
faults. Translation functions are not necessarily injective, and the full translate 
function has permissions per exception-level, and includes not just access per- 
missions but additional fields for cacheability, shareability, security, contiguity, 
and others which we elide for simplicity here. 
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2.2 The translation-table walk 


The current translate function for execution is determined by a system register, 
a translation table base register or TTBR, that contains the physical address of 
a lookup-tree data structure in memory. The details of this structure are (in 
Armv8-A) highly configurable, e.g. for different page sizes, controlled by various 
system registers. In a common configuration used by Linux, it maps 4096-byte 
pages and has a tree up to four levels (0-3) deep. Each non-leaf node of the tree 
has 512 64-bit entries, indexed by specific bit ranges of the virtual address. Each 
entry can be either invalid, meaning that the translate function is undefined for 
this part of the domain; a block (at levels 1 or 2) or page descriptor entry (at 
level 3), returning an output address and permissions; or a table (at levels 0, 1, 
or 2), with the physical (or intermediate physical) address of a next-level table 
with which to continue recursively. 

This translation-table walk function is fully defined in the Arm ASL language. 


2.3 Multiple stages of translation 


The above suffices for an operating system isolating multiple processes from 
each other, but one often wants to isolate multiple operating systems (or other 
guests), managed by a hypervisor. To support this, the architecture provides 
a second layer of indirection: instead of going straight from virtual to physical 
addresses, with a single stage of mapping controlled by the OS, one can have two 
stages, with the OS managing a Stage 1 table which maps virtual addresses to an 
intermediate physical addresses (IPAs), composed with a hypervisor-managed 
Stage 2 table, mapping IPAs to PAs. The full translation composes the two, 
intersecting their permissions. 


translate_stagel : VirtualAddress — IPA x g{ReadWrite,Execute} 
translate_stage2 : IPA — PhysicalAddress x o{Read,Write,Execute} 


Armv8-A has various exception levels (ELs), including ELO (for user processes), 
EL1 (for OSs or other guests), and EL2 (for a hypervisor). These each have 
associated translation-table base registers: 


— TTBRO_EL1: contains a pointer (IPA) to the Stage 1 table for EL1&0, lower 
VA range (process addresses), producing IPAs, controlled by OS at EL1 

— TTBR1_EL1: contains a pointer (IPA) to the Stage 1 table for EL1&0, upper 
VA range (OS kernel addresses), producing IPAs, controlled by OS at EL1 

— VTTBR_EL2: contains a pointer (PA) to the Stage 2 table (second stage for 
IPAs translated at EL1&0), producing PAs, controlled by hypervisor at EL2 

— TTBRO_EL2: contains a pointer (PA) to the single-stage table for EL2 (hyper- 
visor’s own addresses), producing PAs, controlled by hypervisor at EL2 


Each hardware thread has its own base registers (and other system registers), 
and so different hardware threads can be using different address spaces (for 
example, for different processes) at the same time. 
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2.4 Caching translations in TLBs 


A naive hardware implementation of address translation would need many trans- 
lation memory reads — with four levels, up to 24 with both stages enabled, 
for every instruction-fetch, read, or write. This would have unacceptable per- 
formance, so processors have specialised caches for translation-table walk reads 
called translation lookaside buffers (or TLBs). Under normal operation the TLBs 
are invisible to user code, but systems code has to manage them explicitly, to 
change which translation table is currently in use (e.g. when context switching), 
or to make changes to the tables for one process or guest. Without correct man- 
agement a TLB could hold incorrect (stale) data, breaking the protection that 
the address translation is intended to provide. 

The architecture supports explicit TLB maintenance with various flavours of 
the TLBI instruction (TLB invalidate), to invalidate old entries for specific ranges 
of virtual or intermediate physical addresses, or even whole ASIDs or VMIDs at 
once. The memory management unit (MMU) is responsible for performing these 
translations. It does this by looking at the TLB and, if the TLB does not contain 
an entry for the given address (called a miss), it performs the translation table 
walk function as described earlier and caches the result in the TLB (a fill). 

TLB maintenance and TLB misses are expensive, and one would not want 
the cost of TLB invalidation on every context switch, so the architecture provides 
address space identifiers (ASIDs). The translation table base registers include 
an ASID in addition to the table base address, and when translation data is 
cached in a TLB it is tagged with the current ASID, giving the illusion of sepa- 
rate TLBs per ASID, and allowing switching from one to another without TLB 
maintenance. Eventually the system will need to reclaim and reuse a previously 
used ASID, and then TLB maintenance is required to clean that ASID’s old 
entries. There are similar identifiers for Stage 2 intermediate physical memory, 
known as virtual-machine identifiers or VMIDs. 


3 Concurrency Architecture Design Questions 


Now we will introduce the main concurrency architecture design questions that 
arise for Armv8-A virtual memory, within the scope laid out in the introduction. 
As usual, the architecture has to define an envelope of behaviour that provides 
the guarantees needed by software, while admitting the relaxed behaviour of the 
microarchitectural techniques necessary for performance. That means we have to 
discuss both, including just enough microarchitecture to understand the possible 
programmer-visible behaviour, before we abstract it in the semantic models we 
give in §5. The discussion includes points of several kinds: some that are clear in 
the current Arm documentation, some where Arm have a change in flight, some 
that are not documented but where the semantics is (after discussion) obviously 
constrained by existing hardware or software practice, and some where there is a 
tentative Arm intent but it is not yet fixed upon; our modelling raised a number 
of questions of the latter two. To make this as coherent as possible, we discuss 
all these in a logical order, laying out the design principles. We have developed a 
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suite comprised of 214 hand-written Isla-compatible virtual-memory litmus tests 
that illustrate the issues, but to keep this concise we just give the main ideas 
here. In the extended version, we link to tests for each issue. As a sample, we 
explain one pK VM test in detail in §4. 


3.1 Coherence with respect to physical or virtual addresses 


For normal memory accesses, the most fundamental guarantee that architectures 
provide is coherence: in any execution, for each memory location, there is a to- 
tal order of the accesses to that location, consistent with the program order of 
each thread, with reads reading from the most recent write in that order. Hard- 
ware implementations provide this, despite their elaborate cache hierarchies and 
out-of-order pipelines, by coherent cache protocols and pipeline hazard check- 
ing, identifying and restarting instructions when possible coherence violations 
are detected. Previous work on relaxed-memory semantics for architectures has 
taken virtual addresses as primitive, implicitly considering only execution with 
well-formed, constant, and injective address translation mappings. 

Now, we have to consider whether coherence is with respect to virtual or phys- 
ical addresses, for non-injective mappings. For Arm, coherence is w.r.t. physical 
addresses [13, D5.11.1 (p2812)|. This means that if two virtual addresses alias 
to the same physical address, then (still assuming well-formed and constant 
translation): a load from one virtual address cannot ignore a program-order (po) 
previous store to the other; and a load from one virtual address can have its 
value forwarded from a store to the other, and similarly on a speculative branch. 


3.2 Relaxed behaviour from TLB caching 


There are two main aspects of the concurrency semantics of virtual memory: the 
relaxed behaviour arising directly from TLB caching, and the relaxed behaviour 
of the not-from-TLB (non-TLB) memory accesses for translation reads that 
read from memory or by forwarding from po-previous writes, and that might 
supply TLB cache fills. We discuss them in this and the following subsection 
respectively. 


What can be cached: The MMU can cache information from successful trans- 
lations, and also from translations that result in permission faults, but it is archi- 
tecturally forbidden from caching information from attempted translations that 
result in translation faults. This ensures that the handlers of those faults do not 
need to do TLB maintenance to remove the faulting entry [13, D5.8.1 (p2780)], 
and makes the potential behaviour for page-table updates from invalid-to-valid 
and valid-to-any quite different, as we shall see. 

TLB implementations might cache any combination of individual page-table 
entries and partial or complete translations, e.g. from the virtual address and 
context to the physical address of the last-level page. Conceptually, however, we 
can simply view a TLB as containing a set of cached page-table-entry writes 
(i.e., writes that have been read from for a translation), including at least: 


150 Simner et al. 


— the context information of the translation: the VMID, ASID, and the origi- 
nating exception level; 

— the virtual address, intermediate physical address, and/or physical address 
of the translation; 

— the translation stage and level at which the write was used; 

— the system register values used in the translation (those which can be 
cached); and 

— for an entry used for a Stage 1 translation, whether it has been invalidated 
at both stages. 


That additional information allows the various TLBI instructions to target spe- 
cific entries. A translation walk can arbitrarily use either a cached write (if one 
exists) or do a non-TLB read, either from memory or by forwarding from a 
po-previous write, for any stage or level. 


Caching of multiple entries for the same virtual address and con- 
text: High-performance hardware implementations may have elaborate TLB 
structures, including multiple “micro TLBs” per thread. These can be seen as a 
conceptual single per-thread TLB that can hold zero, one, or more entries for 
each combination of input address and the other information above. If zero, a 
translation will necessarily read from memory (with ordering constrained as dis- 
cussed below). If one or more, a translation may use any of those entries or read 
from memory (and the write read from might or might not be cached). However, 
in some cases multiple entries constitute a break-before-make failure, leading to 
relatively unconstrained behaviour; we return to this below. 


When can page-table entries be cached: Any memory read by a translation 
can be cached. Any thread can spontaneously do a translation for any virtual ad- 
dress at any program point, with respect to its context at that point (though this 
interacts with the system-register write/read semantics). Spontaneous transla- 
tions model hardware prefetching, speculative execution, and branch prediction. 
They mean that, in the absence of cache maintenance, translations may use TLB 
entries from arbitrarily old writes. Additionally, any thread may do a sponta- 
neous translation at any point using the configuration from any exception level 
higher than the current one, but not for lower levels. Preventing spontaneous 
walks at lower EL is essential, as during an EL2 hypervisor switch between 
VMs, the EL1 control registers will be in an inconsistent state. Allowing spon- 
taneous walks at higher EL models arbitrary interrupts to the higher level and 
then doing a spontaneous walk there. 

Each virtual-memory access by a thread involves a non-spontaneous transla- 
tion which is constrained by the normal inter-instruction constraints on out-of- 
order and speculative execution by the thread. These constraints are especially 
important in order to understand when a translation must fault: as invalid en- 
tries cannot be cached, a translation that gives rise to such a fault must be at 
least in part from a non-TLB read, subject to these ordering constraints. 


Relaxed virtual memory in Armv8-A 151 


Coherence of translations: Due to the TLB caching as described above, trans- 
lations of the same virtual address by the same thread need not see a coherent 
view of page-table memory. This is in sharp contrast to normal accesses, but 
analogous to instruction-fetch reads [56] and reads from persistent memory [51]. 


Removing cached entries: TLBs may spontaneously forget any cached infor- 
mation at any point. To ensure that a cached entry is removed, software must 
ensure that it will not be spontaneously re-cached. It can do this with a write of 
an invalid entry and then a DSB instruction (data synchronization barrier) to 
ensure that it is visible across the system, followed by a TLBI. 


Break-before-make failures: When changing an existing translation map- 
ping, from one valid entry to another valid entry, Arm require in many cases the 
use of a break-before-make (BBM) sequence: breaking the old mapping with a 
write of an invalid entry; a DSB to ensure that is visible across the system; and a 
broadcast TLBI to invalidate any cached entries for all relevant threads; a DSB 
to wait for the TLBI to finish; then making the new mapping with a write of the 
new entry, and additional synchronisation to ensure that it is visible to trans- 
lations. The current Arm text [13, D5.10.1 (p2795)| identifies six cases of page- 
table updates that without such a sequence constitute BBM failures, and gives 
very severe architectural consequences thereof: failures of coherency, single-copy 
atomicity, ordering, or uniprocessor semantics. Note that these consequences are 
architecturally allowed if there could exist a break-before-make-failure change 
to the translation tables for some virtual address, irrespective of whether the 
program architecturally accesses it. 

This severity is because, in some of the six cases, hardware implementations 
could give rather arbitrary behaviour, e.g. an amalgamation of old and new 
entries. From a software point of view, it seems that one must treat such cases 
more-or-less as fatal errors. This is analogous to the Data-race-free-or-catch- 
fire semantics underlying the C/C++ relaxed memory model [4,33,22,20], in 
which any program with a consistent execution that includes a race between 
nonatomic accesses is deemed to have undefined behaviour, and the C/C++ 
standards do not constrain implementation behaviour for such programs in any 
way. This makes many potential litmus tests that change between valid entries 
uninteresting, as they simply exhibit BBM failures. 

However, for a processor architecture that supports virtualisation, one cannot 
regard BBM failures as allowing completely arbitrary behaviour for the entire 
machine: if one guest virtual machine (at EL1) changes one of its own translation 
mappings without correctly following the BBM sequence, either mistakenly or 
maliciously, that should not impact security of the hypervisor (at EL2) or other 
guests. Instead, one has to bound the arbitrary behaviour to that virtual ma- 
chine, allowing arbitrary memory and register accesses that are possible within 
its context. In our exhaustively executable semantics, to keep litmus-test execu- 
tions finite, we currently simply detect BBM failures; we do not explicitly model 
that arbitrary behaviour. 
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In reality, these six BBM failure cases include some where hardware may 
give such weakly constrained behaviour and others where, because coherence 
is over physical addresses and the mapping may be temporarily indeterminate, 
software might see well-defined but nondeterministic or surprising results. These 
were architected as a guide for system software to produce predictable behaviour, 
and future versions of the architecture might refine this. 

When a hypervisor installs a new guest, it has to be able to reset to a clean 
state. It can do so with a TLBI covering all the previous guest’s processes address 
space. There seems to be no need or support for finer-grain cleanup. 


3.3 Relaxed behaviour of translation-walk non-TLB reads 


Now we turn to the semantics of translation-walk non-TLB reads, those that are 
satisfied from memory or by forwarding, not from a TLB. This matters especially 
when one knows that there are no relevant cached TLB entries, e.g. when an 
invalid entry has been written and a TLBI performed. 


Ordering among the translation-walk reads of an access: Each 
translation-table walk for a virtual-memory access can involve many memory 
reads, one for each level of the table for each stage of translation. 

The diagram on the right is an example walk, where 
each Tn is read of level n of the Stage 1 table. Each of i Ia Ta i i 
those Stage 1 reads must first be translated to get the 112 [T22 i T32 [T42 /T_2 
PA (as the table contains IPAs) and so each Tnk is a TaT TASI TAS t's 
read of level k of the Stage 2 table for the address of the +4, k TAITA 
Stage 1 table at level n. Once the full Stage 1 walk has BS eG ae 
been completed the final output IPA must be translated 
to the final PA, and those are the final 4 T_n reads, of the Stage 2 table at level n. 
The reads are ordered one after another in the order they appear in the ASL 
walk function. This ordering must be respected by hardware as software relies 


on it when building the tables bottom-up. 


Dependencies into translation-walk non-TLB reads: Address dependen- 
cies into a memory-access instruction in classic “user” models are now explainable 
as dataflow dependencies to the translation reads of those accesses, as the address 
has to be available before a walk can start. These are virtual-address dataflow 
dependencies (contrasting with physical-address coherence). 


Translation-walk non-TLB reads from non-speculative same-thread 
writes: 


PO-past A translation-walk non-TLB read might read from a po-previous page- 
table-entry write, but it is only guaranteed to see such a write if there is enough 
intervening synchronisation. Arm have recently introduced Enhanced Transla- 
tion Synchronization (ETS), optional in Armv8.0 and mandatory from Armv8.7. 
Armv8-A implementations without ETS require both a DSB, to make the write 
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visible to translation-walk non-TLB reads, and an ISB, to ensure that any trans- 
lations for later instructions that were done out-of-order, before the write, are 
restarted. With ETS, only the DSB is required for a translation-walk non-TLB 
read to definitely see the write, though one might still need an ISB if the 
new translation enables new instruction fetch. Because invalid entries cannot 
be cached, this means that if an entry is initially invalid, then after a write of a 
valid entry and a DSB;ISB/DSB, translations will use that valid entry. However, 
the DSB;ISB/DSB does not remove cached entries, so an initially valid entry 
might be cached by a spontaneous walk, so even after a write (of an invalid or 
non-BBM-failure valid entry) and a DSB;ISB/DSB, the old entry could still be 
used by translations. One would need a TLBI sequence to remove old cached 
entries, which we return to below. 


PO-future The Armv8-A architecture allows load-store reordering, but it does 
not allow writes to become visible to other threads while they are still specula- 
tive. In the same vein, translation-walk non-TLB reads cannot read from po-later 
page-table-entry writes [13, D5.2.5 (p2683)|. Before the po-earlier translation is 
complete, one cannot know that it is not going to fault, so the later write has to 
be considered speculative. This prevents a thread-local self-satisfying translation 
cycle, analogous to the prevention of load-store cycles with dependencies. 


PO-present On the margin, can a translation-walk non-TLB read for a write 
access see that write, or a distinct write from the same instruction? The second 
case could arise from a store-pair or misaligned store that does two writes, with 
one to a page-table-entry that could be used by the other, though real code 
would typically not do this intentionally. This is explicitly allowed by the cur- 
rent architecture text [13, D5.2.5 (p2683)]. However that text does not specify 
whether the translations for those two writes could both read from the other, a 
self-satisfying translation cycle where the writes write each others translations. 
In general such self-satisfying cycles give rise to thin air behaviours and the 
architectural intent is to forbid them. 


Translation-walk non-TLB reads from speculative same-thread writes: 
Speculative execution requires translation walks, which might result in addi- 
tional page-table entries being cached, but in most cases this is indistinguishable 
from the effects of a non-speculative spontaneous walk. However, one has to ask 
whether a translation-walk non-TLB read can see a po-previous write that is 
still speculative, e.g. while both instructions follow an as-yet-unresolved condi- 
tional branch. It is clear that the result of such a walk should not be persistently 
cached, or made visible to other threads (via a shared TLB), while it remains 
speculative. Moreover, such translations could lead to arbitrary reads of read- 
sensitive device locations, which one normally relies on the MMU to prevent. 
The conclusion is therefore that this must be forbidden. 


Translation-walk non-TLB reads from same-thread writes, forbidden 
past (same-thread TLBI completion): To remove an existing mapping on a 
single thread, one needs first to write an invalid entry, then a DSB to ensure that 
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has reached memory and thus is visible to translation-walk non-TLB reads (to 
prevent spontaneous re-caching), then a TLBI to invalidate any cached entries, 
then a DSB to wait for TLBI completion. Without ETS, one also needs an ISB 
to ensure that po-later translations that have been done early are restarted. 
With ETS, the ISB is not always necessary, though might still be needed for its 
instruction-cache effects if the change of mapping affects instruction fetch. After 
all that, an attempted access by that thread is guaranteed to fault. 


Translation-walk non-TLB reads from other-thread writes, guaran- 
teed past, initially invalid: Now consider when a translation-walk non-TLB 
read is guaranteed to see a write by another thread of a new entry, assuming 
that the entry was previously invalid and any cached entries for it invalidated. 
Consider a two-thread message-passing case, where a producer PO writes a new 
valid page table entry (pte_valid), 
then has some ordering before a P0 Pi 

write of a flag, while a consumer P1 |a:W pte(x)=pte-valid|c:R flag=1 

reads the flag, then has some order- <Producer ordering> |<Receiver ordering> 
ing before an access Rx or Wx that [Piw ftag=1 AiO a eI 
needs that entry for a translation Tx of virtual address x. 

On some Armv8-A implementations that do not support ETS, some “ob- 
vious” combinations of ordering on PO and P1 could lead to an abort of the 
translation of (d), which some OS software would find difficult to handle. This 
was the main motivation for ETS: implementations without it can have weak be- 
haviour, requiring strong synchronisation to prevent the abort, while with ETS 
the architecture is stronger, requiring only weaker ordering to prevent the abort. 

Without ETS, two combinations of ordering are architected as sufficient to 
ensure that the translation (d) sees the new valid entry: 


1. PO has any ordered-before relationship, and Pl has DSB+ISB. 
2. PO has DSB; TLBI; DSB, and P1 has any ordered-before relationship. 


In Case 1, the message-passing is enough to ensure the write (a) is in main 
memory, the P1 ISB ensures that any out-of-order translation of (d) is restarted, 
and the P1 DSB keeps the read (c) and that ISB in order. In Case 2, the first DSB 
ensures the write is visible to all threads, the TLBI (broadcast, for the virtual 
address x) invalidates any older cached entry on P1, and the second DSB waits 
for that TLBI to be complete, after which any new translation on P1 will have to 
see the new entry. However, it appears that the probability of an unhandleable 
abort in practice, where one usually does not have these operations immediately 
adjacent, and where in many cases the abort could be handled, has been judged 
low enough that OS code is not necessarily using either of these. 

With ETS, the architecture says [13, D5.2.5,p2683] that “if a memory access 
RW1 is Ordered-before a second memory access RW2, then RW1 is also Ordered- 
before any translation table walk generated by RW2 that generates a Translation 
fault, Address size fault, or Access flag fault.” Microarchitecturally, the intuition 
here is that with ETS any translation done while speculative that leads to such 
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a fault will have to be reconfirmed as faulting when execution is no longer spec- 
ulative, so an early faulting translation of (d) would have to be restarted after 
the ordered-before edges have ensured that (a) is visible. However, in the case 
that the RW2 instruction faults, there is no read or write event, and if the fault 
is a translation fault, there is no physical address. One therefore has to ask what 
the meaning of ordered-before edges into RW2 is, especially for the parts of 
ordered-before dependent on physical addresses, such as coherence. The conclu- 
sion is that this should be only the non-physical-address parts of ordered-before 
into RW2, and in modelling one needs a “ghost” event to properly record what 
the dependencies would have been if it had succeeded. Note that this includes 
ordered-before to RW2 that ends with a data dependency into a write, even 
though that data would not normally be necessary for the translation. 

Even with ETS, one might need an ISB on P1 if the new translation affects 
instruction fetch. 


Translation-walk non-TLB reads from other-thread writes, guaranteed 
past, initially valid (other-thread TLBI completion): The following test 
has a read-only mapping for some physical address that is updated with a new 
writeable mapping to the same 


physical address, followed by a PO 
message-pass to another thread STR pte_writeable, [pte(x)]} LDR XO, [y] 
that attempts to write. There is DSB SY DMB SY 

no requirement for break-before- TLBI WARE TS [page(x)] MOV hey 
make here, as the output address oy j Bue iced 
has not changed, but TLB main- aR a [y] i 
tenance is required to entre that Forbid: 1:X0=1 & permission_fault(L0,x)? 


the new writeable entry is guar- 
anteed to be used by later translation reads. 

Arm forbid the outcome where the STR faults due to a permission check. This 
is because the TLBI only completes once all instructions using any old translations 
which would be invalidated by the TLBI, on all other threads that the TLBI 
affects, have also completed, and the following DSB waits for that (the same- 
thread case is different; see §3.3). In practice this means that once the TLBI 
completes, one of the following holds: either the final STR has not performed its 
translation of x yet and will be required to see the writeable mapping for its page 
table entry (pte); or the STR has translated using the new writeable mapping; or 
the STR has already translated using the old read-only mapping, in which case we 
know that the STR has finished and performed its write, since the TLBI could not 
complete while it was still in-progress. In that case if the STR has completed, then 
so must have the locally-ordered-before LDR, and that must have read 0. This 
explanation also covers the make-after-break case above, for non-ETS Case 2. 

This is reflected in text to be included in future versions of the Arm ARM: 
A TLB maintenance operation [without nXS] generated by a TLB maintenance 
instruction is finished for a PE when: 


1. all memory accesses generated by that PE using in-scope old translation in- 
formation are complete. 
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2. all memory accesses RWx generated by that PE are complete. RWx is the set 
of all memory accesses generated by instructions for that PE that appear in 
program order before an instruction (I1) executed by that PE where: 

(a) I1 uses the in-scope old translation information, and 

(b) the use of the in-scope old translation information generates a syn- 
chronous data abort, and 

(c) if I1 did not generate an abort from use of the in-scope old translation 
information, I1 would generate a memory access that RWx would be 
locally-ordered-before. 


Translation-walk reads from same- and other-thread writes, forbidden 
past (break-before-make): Now we can finally return to the break-before- 
make sequence. Normal reads cannot read from the coherence-predecessors of 
the most coherence-recent write that is visible to them, but translation reads 
can read old (non-invalid) values from a TLB. To prevent this, and to ensure 
that a translation read sees a new page-table entry, one has to both ensure that 
any old TLB entries are invalidated, with a suitable TLBI, and that the new 
entry is visible to translation-walk non-TLB reads. 

Armv8-A says [13, D5.10.1 (p2795)| “A break-before-make sequence on chang- 
ing from an old translation table entry to a new translation table entry requires 
the following steps: (1) Replace the old translation table entry with an invalid 
entry, and execute a DSB instruction. (2) Invalidate the translation table entry 
with a broadcast TLB invalidation instruction, and execute a DSB instruction 
to ensure the completion of that invalidation. (3) Write the new translation table 
entry, and execute a DSB instruction to ensure that the new entry is visible.”. 

Typically the write of an invalid entry and TLBI would be on the 
same thread, but more generally, any shape as below should be forbidden, 
where Tx is a translation-walk read for an 
access of x and the trf relation shows 
the page-table write it reads from. In Wote(x)-invalid © TLBI — Wpte(x)=desc(x) 
other words, the sequence ensures that ob ab 
the write of the invalid entry, and of any ' l a ' 


PO P1 P2 


co-predecessor writes, are hidden behind Bee pee BSE 
the new page-table entry as far as new trf y 
translations are concerned. Here the PO ISB (if...) 
DSB and P0-to-P1 ob ensure the PO write y 
has propagated to memory before the P1 Tx faults 


TLBI starts; the P1 DSB waits for that TLBI to have finished on all threads; the 
P1-to-P2 ob ensures that has happened before the new page-table-entry write 
starts; and the DSB ensures the new write has reached memory and so is vis- 
ible to translation before subsequent instructions. The P2 ISB is needed if on 
non-ETS hardware, to force restarts of any out-of-order translations for po-later 
instructions, or (on any hardware) if P2=P1, to ensure any later translations on 
the TLBI thread are restarted, or if the new mapping affects instruction fetch. 
This generalisation seems necessary, as a TLBI might be performed by a 
virtual CPU at EL1 which is interrupted and rescheduled by an EL2 hypervisor. 
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One should be able to rely on the hypervisor doing a DSB on the same hardware 
thread as part of the context switch, and that has to suffice. It is sound because 
the DSBs and TLBI are all broadcast, though note that the DSB waiting for 
TLBI completion has to be on the same hardware thread as it. 


Translation-walk non-TLB reads from other-thread writes, forbidden 
future: Above we saw that translation-walk non-TLB reads should not read 
from po-later writes. How should that be generalised to multiple threads? For 
the simplest example, consider the trans- PO P1 

lation version of the LB test on the 
right, in which two threads translation- i B 
read from each other’s po-future (iio re- í ' 
lates translation reads to their accesses). ew) mee 
Standard LB shapes for normal accesses without dependencies are allowed in 
Armv8-A, but this example should be forbidden: until each translation is done, 
one cannot know that the first instruction on each thread will not abort, so one 
could not make the po-later write visible to the other thread without inter-thread 
roll-back. In other words, the possibility of translation aborts creates ordering 
rather like a control dependency from translation reads to po-later writes. 


Tx ll Rx Ty lO, Ry 


Multicopy atomicity of translation-walk non-TLB reads: The ARMv7 
and early Armv8-A architectures for normal accesses were non-multicopy-atomic: 
a write could become visible to some other threads before becoming visible to all 
threads, broadly similar in this respect to the IBM POWER architecture [1,53]. 
This is one of the most fundamental choices for a relaxed memory model. In 
2017 Arm revised their Armv8-A architecture to be multicopy-atomic (other 
multicopy-atomic, or OMCA, in their terminology), a considerable simplifica- 
tion [49,12]. However, there was no consideration at the time of whether this 
should also apply to the visibility of writes by translation-walk non-TLB reads, 
or of the force of the ARM statement that a translation table walk is considered 
to be a separate observer [13, D5.10.2 (p2808)]. 

For example, consider the following translation-read analogue of the classic 
WRC-+addrs test, which would be forbidden in OMCA Armv8-A for normal 
reads. Suppose one has ETS, the last-level page-table entries for x and y are 


initially invalid and not cached PO P1 P2 

in any TLB, PO writes a valid  wpteg=vaia tte Tx —l0-» Rx Ty © p py 
entry for x, P1 does a transla- a ea = 
tion that sees that entry and Wied rei 


then (via an address depen- 

dency) writes a valid entry for y, then P2 does a translation that sees that 
entry and then (via an address dependency) tries a translation for x, is that last 
guaranteed to see the valid entry instead of faulting? This might be exhibited 
by a microarchitecture with a shared TLB between PO and P1 (e.g. if they are 
SMT threads on the same core, or have a shared TLB for a subcluster). The 
tentative Arm conclusion is that this should be forbidden, to avoid software 
issues with unexpected aborts similar to those motivating ETS. Now consider 
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the above translation version of LB, generalising from po-future writes to other 
ob-future writes. For transitive combinations of reads-from and dependencies, it 
should clearly still be forbidden, to avoid needing inter-thread roll-back, but for 
ob including coherence edges (coe) one can imagine that a translate read could 
see a write before the coherence relationships are established, analogous to the 
weakness of coherence in the Power non-MCA model. 

Discussion of these and others with Arm led to the tentative conclusion for 
Armvs8-A that translation-walk non-TLB reads (like normal reads) do not see any 
non-OMCA behaviour. In other words, there is no programmer-visible caching 
observable to some non-singleton subsets of threads’ translations but not others. 


3.4 Further issues 


Our discussions with Arm identified and clarified various other architectural 
choices, though for lack of space we cannot discuss them fully here, and our mod- 
els do not cover them at present. To give a flavour: (1) Misaligned or load/store- 
pair instructions give rise to multiple accesses, which might be to different pages. 
Each has their own translation; not ordered w.r.t. each other, and with no pri- 
oritisation of faults between them. As noted in §3.3, one might translate-read 
from the other, but not both simultaneously. (2) Normal registers act like a per- 
thread sequential memory, with reads reading from the most recent po-previous 
write, but the system registers that control translations can have more relaxed 
behaviour, requiring ISBs to enforce sequential behaviour. (3) The architecture 
requires, and OSs rely on, the fact that turning on the MMU does not need 
TLB maintenance. However, in a two-stage world, if Stage 1 is off, one is still 
using the TLB for Stage 2, so entries do get added to the TLB. When one later 
turns on Stage 1, it is essential that the entries added from those earlier Stage 2 
translations are not used, so one has to regard them as from a 257’th ASID. 


4 Virtual memory in the pKVM production hypervisor 


Protected KVM, or pKVM [30,27,2], is currently being developed by Google to 
provide a common hypervisor for Android, to provide improved compartmental- 
isation by a small trusted computing base (TCB) between the Linux kernel and 
other services. pKVM is built as a component of Linux. During boot, the Linux 
kernel hands over control of EL2 to the pKVM code, which constructs a memory 
map for itself and a Stage 2 memory map to encapsulate the Linux kernel. The 
Linux kernel thereafter runs only at EL1 (managing EL1&0 Stage 1 memory 
maps for itself and for user processes), as the principal guest, also known as the 
host (not to be confused with the host hardware). Other services can run as other 
guests, which are protected from the kernel and vice versa. The kernel remains 
responsible for scheduling, but context switching and inter-guest communication 
is done by hypervisor calls to the pKVM code at EL2. This gives us an ideal 
setting in which to examine the management of virtual memory by production 
code for Armv8-A relaxed-memory-concurrency, with both one and two stages 
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of translation (for EL2 and EL1&0 respectively). The pK VM codebase is small, 
so it is feasible to examine all uses of TLB management, and we benefit from dis- 
cussions with the pKVM development team. We have manually abstracted the 
main pKVM relaxed-virtual-memory scenarios into 14 tests. To give a flavour of 
these, we give one test in detail, which also illustrates the general form of virtual 
memory litmus tests; the others are described in the extended version. 

In the simplest case where pKVM is just switching from one virtual CPU 
(vCPU) to another vCPU in a different VM, pKVM restores the per-CPU reg- 
ister state and sets the VTTBR with the new VMID. So long as the two vCPUs 
are using disjoint VMIDs there is no requirement for TLB maintenance. 

This test, pKVM.vcpu_run, is below, typeset (lightly hand-edited) from the 


AArch64 pKVM.vcpu_run 


Page table setup: Initial state: 
option default_tables = false;|PSTATE.EL=0b10 // initial exception level is EL2 
virtual x; VBAR_EL2=0x1000 // exception vector base address 
physical pal pa2; ELR_EL2=L@: // exception link register, to return to from EL2 
intermediate ipal ipa2; SPSR_EL2=0b00101 // saved program status 
sltable hyp_map 0x200000 { TTBRO_EL1=ttbr(asid=0x00,base=vml_stagel) // EL1 Stage 1 
identity 0x1000 with code; | VTTBR_EL2=ttbr(vmid=0x0001,base=vml_stage2) // Stage 2 
x> invalid; } TTBRO_EL2=ttbr(base=hyp_map,asid=0x00) // EL2 
sltable vml_stagel 0x2C0000 { | xo=ttbr(asid=0x00, base=vm2_stagel1) 
xt>ipal; } x1l=ttbr(base=vm2_stage2, vmid=0x0002) 
sltable vm2_stagel 0x300000 { | x3=x 
xy ipa2; } Thread 0 (with pKVM source lines) 
sżtabte vmlestage20x240900 T PR ttbr0_el1, x0 // kvm/hyp/sysreg—sr .h:96 
ee ee. msr vttbr_el2, x1 // include/asmfvm_mmu.h:276 
sltable vml_stagel; } ie // A 
s2table vm2_stage2 0x280000 { ldr x2, [x3] // in guest 
tpalimh invalid; Thread 0 EL2 handler 
ipa2 > pa2; 
sltable vm2_stagel; } 0x1400: 
*pa2 = 1; mov x2, #0 


Final state: 0:x2=0 


TOML input format of our Isla tool (§6.1). Here there is a single physical CPU, 
initially running a virtual machine VM1, with VMID 909x0001, at EL1. The section 
on the left defines the initial and all potential states of the page tables, and any 
other memory state. This test sets up separate translation tables for pKVM at 
EL2 (which has just a single stage) and for two VMs (each with two stages, Stage 
2 controlled by pKVM and Stage 1 controlled by the VM). pK VM’s own mapping 
hyp_map maps its code. VM1’s own Stage 1 mapping vm1_stagel maps virtual 
address x to ipal, and the initial pk VM-managed Stage 2 mapping vm1_stage2 
maps that ipal to pal, which implicitly initially holds 0. These page tables are 
described concisely by a small declarative language we developed, determining 
the page-table memory (here ~30k) required for the Armv8-A page-table walks. 

The top-right block gives the initial Thread 0 register values, including the 
various page-table base registers. The bottom-right blocks give the code of the 
test. This starts running at EL2, as one can see from the PSTATE.EL register 
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value. The key assembly lines are annotated with the pKVM source line num- 
bers they correspond to. To switch to run another virtual machine VM2, with 
VMID 6x0002, on this same physical CPU, pKVM changes VTTBR_EL2 to the 
new vm2_stage2 mapping and, as part of the context-switch register-file changes, 
restores TTBRO_EL1 to the VM2’s own Stage 1 mapping vm2_stagel. The code 
then executes an ERET (“exception-return”) instruction to return to EL1, and 
then tries to read x. The test includes a final assertion of the relaxed outcome 
that register x2=0, which could occur if the ldr translation used the old VM1 
mapping instead of VM2’s mapping. In this case that should not be allowed. 

Other tests capture more elaborate scenarios. For example, currently the host 
kernel manages VMIDs and assigns each VM its own VMID. If the host runs out 
of VMIDs to allocate to new vCPUs, it currently revokes all previously allocated 
VMIDs and re-allocates from the beginning, during which pK VM has to ensure 
that any old vCPUs’ translations using that VMID are expelled from any TLBs 
(pKVM. vcpu_run.update_vmid). If there is a concurrently executing vCPU using 
that VMID, that vCPU must be paused until after the new VMID generation 
(and hence any required TLB maintenance), before continuing with the freshly 
allocated VMID (pKVM.vcpu_run.update_vmid. concurrent). 

For another example, for pKVM to maintain the illusion that each vCPU is 
on its own core, the per-core state must be cleaned between running different 
vCPUs, including ensuring that translations for one vCPU are not cached and 
visible to another, even if they happen to be in the same VM (and using the 
same VMID) (pKVM. vcpu_run.same_vm). 


5 Model 


We now define a semantic model for Armv8-A relaxed virtual memory that, to 
the best of our knowledge, captures the Arm architectural intent for the scope 
laid out in §1 and discussed in §3, including Stage 1 and Stage 2 translation-table 
walks and the required TLB maintenance. For some important questions, most 
notably for multi-copy atomicity, the Arm intent is currently tentative, so it is 
not possible to be more definitive. To capture just the synchronization required 
for “simple” software such as pKVM to work correctly we also give a weaker 
model: instead of trying to exactly capture the architecture or the behaviour of 
hardware, it has individual axioms for each behaviour that such software needs 
to rely on. This gives an over-approximation to the architecture, which we prove 
sound with respect to the model given in this section. The two models together 
delimit the design space. 

In §3 and 84 we described the design issues in microarchitectural terms, 
discussing the behaviour of TLB caching and translation-walk non-TLB reads, 
along with the needs of system software. We now abstract from microarchitec- 
ture: instead of explicitly modelling TLBs, we simply include a translation-read 
event for each read performed by architected translation-table walks, and de- 
fine which writes each such translation-read can read from. We give the model 
in an axiomatic Herd-like [9] style, as an extension to the base Armv8-A se- 
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mantics [26,49,13]. In principle it would be desirable to also have equivalent 
abstract-microarchitectural operational models, as for base Armv8-A [49,48] but 
with explicit TLBs for each thread and events for reading from and into the 
TLB. However, address translation introduces many more events to litmus-test 
executions, which would make them harder to explore exhaustively, and a proof 
of equivalence would be a major undertaking, so we leave this to future work. 


The base Armv8-A axiomatic model is defined as a predicate over candidate 
executions, each of which is a graph with various events (reads, writes, barriers) 
and relations over them, notably the per-thread program order po, the location 
coherence order co, the reads-from relation rf from writes to reads, the address, 
data, and control-dependency (addr, data, ctrl) subsets of po, and others. The 
base model is essentially the conjunction of an external (inter-thread) acyclicity 
property, effectively stating that the execution must respect some total order of 
events hitting the shared memory, constrained by the derived ordered-before (ob) 
relation; an internal acyclicity property, enforcing per-location coherence; and 
an atomic axiom for atomic and exclusive operations. As usual in Herd-style mod- 
els, relations are suffixed e or i to restrict to their inter-thread or intra-thread 
parts. The Herd concrete syntax for relational algebra uses [X] for the identity on 
a set X, ; for composition, ~ for complement, | and & for union and intersection, 
and « for product. We add translation data to events, including virtual, interme- 
diate physical, and physical addresses (as determined by the translation regime). 
We add events for translation reads (T), TLB maintenance (TLBI), taking and 
returning from an exception (TE and ERET), and writing system registers (e.g. MSR 
TTBR). We modify the loc and co relations to relate events with the same physi- 
cal address, and add a translation-reads-from trf to relate W to the T that read 
from it. To identify events with the same address we add same-va and same-ipa 
relations, relating events to the same virtual or intermediate physical address, 
and same-{va, ipa}-page for events in the same page. To identify events with the 
same address space or virtual machine ID, we use same-vmid and same-asid. The 
translate-read events within an instruction are related in the order they appear 
in the sequential ASL/Sail execution, both to each other and to any memory 
access or fault event, with the iio (“intra-instruction order”) relation. We de- 
rive the addr relation from a new primitive tdata relation which relates read 
events to events that use that read value in the translation or computation of an 
address. For convenience we define new event sets: C for all cache-maintenance 
operations (DC, IC, and TLBI instructions); T_f for all translation-read events 
which read a descriptor which causes a fault; W_inv for all the write events which 
write an invalid descriptor; Stagel and Stage2 for the T events which originate 
from the respective stage of translation; ContextChange for all context-changing 
events (such as writes to translation-controlling system registers); and CSE for all 
context-synchronizing events (taking and returning from exceptions and ISB). 


The model is in Fig. 1, in full except for the tlb-affects relation. Its basic 
form is very similar to previous multicopy-atomic Armv8-A models. It still has 
external, internal, and atomic axioms, to which we add a translation-internal 
axiom for ensuring translations do not read from po-later writes. 
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let tlb-affects = (* ordered-before a translation fault *) 
(* see extended version *) let obfault = 
data ; [Fault & IsFromWw] 
let TLB_barrier = | speculative ; [Fault & IsFromw] 
({TLBI] ; tlb-affects ; [T] ; tfr ; [W])*-1 | [dmbst] ; po ; [Fault & IsFromw] 
& wco | [dmbld] ; po ; [Fault & (IsFromW|IsFromR) ] 
| [A]Q] ; po ; [Fault & (IsFromw | IsFromR) ] 
let maybe_TLB_cached = | [R|W] ; po ; [Fault & IsFromW & IsReleaseW] 
([T] ; trf*-1 ; wco ; [TLBI-S1]) & tlb- 


affects*-1 (* ETS-ordered-before *) 
let obETS = 
let tcachel = [T & Stagel] ; tfr ; TLB_barrier (obfault ; [Fault]) ; iio*-1 ; [T_f] 
let tcache2 = [T & Stage2] ; tfr ; TLB_barrier | ({TLBI] ; po ; [dsb] ; instruction-order ; 


[T]) & tlb-affects 
let speculative = 


ctrl (* dependency-ordered-before *) 
addr; po let dob = 
[T] ; instruction-order addr | data 
(* translation-ordered-before *) speculative ; [W] 
let tob = addr; po; [W] 
[T_f] ; tfre (addr | data); rfi 
({T_f] ; tfri) (addr | data); trfi 
& (po ; [DSB.SY] ; instruction-order)*-1 
[T] ; iio ; [R|W] ; po; [W] (x atomic-ordered-before *) 
speculative ; trfi let aob = rmw 
(* observed by *) [range(rmw)]; rfi; [A | Q] 
let obs = rfe | fr | wco 
trfe (x barrier-ordered-before *) 
(* ordered-before TLBI and translate *) let bob = [R] ; po ; [dmbld] 
let obtlbi_translate = [w] ; po ; [dmbst] 
tcachel [dmbst]; po; [W] 
tcache2 [dmbld]; po; [R|W] 
& (iio*-1 ; [T & Stagel] ; trf*-1 ; wco^-1) [L]; po; [A] 
(tcache2 ; wco? ; [TLBI-S1]) [A | Q]; po; [R | W] 
& (iio*-1 ; [T & Stagel] ; maybe_TLB_cached [R | W]; po; [L] 


) [F | C]; po; [dsbsy] 
[dsb] ; po 
(* ordered-before TLBI *) 
let obtlbi = (* Ordered-before *) 
obtlbi_translate let ob = (obs | dob | aob | bob 
| [R|W|Fault] ; iio*-1 ; (obtlbi_translate & iio | tob | obtlbi | ctxob | obfault | 
ext) ; [TLBI] ObETS) “+ 
(* context-change ordered-before +) (* Internal visibility requirement *) 
let ctxob = acyclic po-loc | fr | co | rf as internal 
speculative ; [MSR] (* External visibility requirement +) 
| [CSE] ; instruction-order irreflexive ob as external 
| [ContextChange] ; po ; [CSE] (x Atomic requirement *) 
| speculative ; [CSE] empty rmw & (fre; coe) as atomic 
| po ; [ERET] ; instruction-order ; [T] (* Writes cannot forward to po-future 


translates *) 
acyclic (po-pa | trfi) as translation-internal 


Fig. 1: Strong Model (with baseline Armv8-A model parts in gray) 


Most of the changes to the model are in the external axiom, where we add 
several relations to ordered-before (ob): iio relates the intra-instruction events 
ordered by the ASL; tob (“translation ordered-before”) ensures the order arising 
from the act of translation itself is respected; obtlbi orders translates and their 
explicit memory events with TLBIs which affect these translations; and ctxob 
(“context ordered-before”) orders events which must come before some context- 
changing operation or after some context-synchronizing operation. We also add 
a generalised coherence-order relation, wco, an existentially quantified total order 
expressing when TLBIs complete w.r.t. writes. 
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Coherence: By making loc (and therefore rf and co) relate events with the 
same physical addresses, we get coherence over physical addresses rather than 
virtual. Coherence of writes to translation tables is expressed in two places: in- 
cluding trfe in obs captures the fact that translation-table reads from memory 
microarchitecturally come from the ‘flat’ coherent storage subsystem, and so 
the writes that they read from must have been propagated before the transla- 
tion happened; and the translation-internal axiom forbids forwarding against 
program-order. 


TLB maintenance and break-before-make: The obtlbi relation ensures 
that instructions whose translations read from writes which are “hidden” by 
some TLBI instruction are ordered before the completion of that TLBI. This is 
achieved by the two clauses of obtlbi: the first clause ensures the translation- 
before-TLBI ordering is preserved, and the second clause orders the explicit 
memory access of any such instruction with the same TLBI as the first clause. To 
do this, the model computes the set of writes which are in effect “barriered” by 
a given TLBI instruction. This is done with the tcache relations, which decides 
which TLBIs effect which translations by looking at the addresses each use and 
the wco ordering between the TLBIs and related writes. 

To accurately match up each of the various TLBI instructions with the transla- 
tions they may affect, we define a tlb-affects relation which relates TLBI events 
with the T events they are relevant to. We elide the full definition here, as it is 
simply the product of the enumeration of TLBI variants with the set of trans- 
lations that match the exception level, stage, address, ASID or VMID given in 
the TLBI instruction. obtlbi_translate then uses tlb-affects and wco to order 
any translations that read-from ‘stale’ writes from before the invalidation with 
the TLBI that invalidated those writes. One notable subtlety here is in Stage 2 
translations: since the TLB could store whole VA to PA mappings we must check 
that the correct Stage 1 invalidations have been performed, in addition to the 
Stage 2 ones, to be able to order the Stage 2 translation with the TLBI. 


Translation-table-walk reading from memory: As noted in §3.3, a transla- 
tion which results in a translation fault must read from memory or be forwarded 
from program-order earlier instructions, and those memory reads behave multi- 
copy atomically. In general the only time the model can guarantee that such a 
memory read happens is when the read results in a translation fault, since entries 
that result in a translation fault cannot be stored in the TLB (§3.2). The model 
captures this succinctly by including [T_f];tfr in ob. 

In general, a translation-read is ordered after the write which it reads from, 
as captured by the inclusion of the trfe edge in ob; this is strong enough to 
ensure that TLB fills and faulting memory walks pull values out of the memory 
system in a coherent way, but still weak enough to allow other-multi-copy-atomic 
behaviour such as forwarding. 

As mentioned in §3.3, a DSB ensures that writes are propagated out to mem- 
ory. For translations this amounts to ensuring that a faulting translation cannot 
read-from something older than a po-previous DSB-barriered write, as captured 
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by the last edge in tob which says that a tfri edge from such a faulting trans- 
lation must not have an interposing DSB. 

Note that the absence of the full tfr relation in ob for non-faulting trans- 
lations intentionally allows some incoherence, in essence allowing a translation- 
read to “ignore” a newer write. 


Context-changing operations: In general, the sequential semantics takes care 
of the context, such as current base register and system register state, for us. 
The ctxob relation simply ensures that such context-changing operations cannot 
be taken speculatively, and that context-synchronization ensures that all po- 
previous context-changing operations are ordered-before po-later translations. 


Detecting BBM Violations: As discussed in §3.2, we do not model in detail 
the bounded-catch-fire semantics that currently architecturally results from a 
missing break-before-make sequence, as that would make it hard to enumerate 
possible litmus-test executions. Instead, because what one normally wants to 
know for litmus tests is that a test does not exhibit a BBM failure, we conser- 
vatively detect the existence of such violations and flag them for the user. This 
is achieved through a per-candidate-execute predicate, written in SMT, which 
looks for a situation which could be a break-before-make violation. It does this 
by asserting that there does not exist a pair of writes which conflict such that 
there is no interposing break-and-TLBI sequence. This approach is slightly over- 
approximate, as it might look for two writes that technically conflict even if they 
(for other reasons) are not used at the same time. This means that while we sup- 
port programs that switch from one page table to another, we do not support 
programs that garbage collect page-table memory and then repurpose it. 


ETS: We discussed the Armv8-A optional ETS feature, providing additional 
ordering strength for translations. The intuition is that the model would have 
ghost events in the event an instruction faults, to represent the explicit read or 
write which would have happened had the instruction not faulted. The model 
would then have to compute a special variant of ob including such dependencies, 
but without the physical-address-dependent relations such as loc, rf and co. 
Then any edge in the version of ob with the ghost events would become an 
edge in the real ob but attached to the faulting translation. To capture this, 
our model produces fault events which have the correct dependencies (and fault 
information) and the model orders the fault event with respect to program-order 
previous events which would have ordered and place those into ob. This involves 
manually adding [dmb] ; po ; [fault], addr ; po ; [fault & FromW], etc. to 
ob. The obETS relation then orders translations which result in a translation 
fault after anything the fault is ordered-after. 


Metatheory: To establish that our models provide a simple and sound abstrac- 
tion we prove three theorems: that for static injectively-mapped address spaces, 
any execution which is consistent in the model with translation, erasing transla- 
tion events gives an execution that is consistent in the original Armv8-A model 
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without translation; that for any consistent execution in the original Armv8-A 
model, there is a corresponding consistent execution in our extended model with 
translations; and that our weak model is a sound over-approximation of our full 
translation model, i.e., that for any consistent execution in our full translation 
model, that same execution is consistent in the weak translation model. 


6 Tooling 


6.1 Isla-based model evaluation 


Making relaxed-memory semantics exhaustively executable is essential for ex- 
ploring their behaviour on examples [66,54,53,20,9,36,65,23,63,49,56]. Handling 
relaxed virtual memory brings several new challenges. First, even just the se- 
quential definition of Armv8-A address translation, with the page-table walk and 
its options, is remarkably intricate, defined in thousands of lines of Arm’s ASL 
instruction description language. Manually reimplementing a simplified version 
would be error-prone and incomplete, so we instead build on our Isla tool [15], 
which integrates the full 123,000 line Armv8-A ISA semantics (as defined by Arm 
in ASL and automatically translated into Sail [14]), with SMT-based tooling to 
evaluate tests w.r.t. axiomatic concurrency models. Previously Isla supported 
only “user” models, expressed in a language based on relational-algebra similar 
to the Cat language of Herd [9]. 

Previous litmus tests typically involved only a few abstract memory locations 
and events, but even simple virtual memory tests require 30kB of page tables, 
each “user” memory access might have 24 or more page-table accesses, and each 
64-bit descriptor may be represented by a symbolic value representing all possible 
states that descriptor can be in. To avoid overwhelming the SMT solver during 
symbolic execution, the formula representing each symbolic descriptor is created 
dynamically when read. When encoding the final SMT problem that decides 
whether a candidate execution is allowed, we ensure that only the parts of the 
page tables actually used by that candidate execution are included. We also 
implemented a model-specific optimization that removes irrelevant translation 
events which cannot affect the result of the test, improving performance by a 
factor of 13 on average, and up to 90 times for some tests. Third, we had to 
provide a convenient way to express the page table configuration for each test, 
with the declarative language of which we saw a small part on the left-hand side 
of the §4 test. 
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A good user interface is essential. Above, we show an Isla-generated execution 
for a WRC test like that of §3.3, showing how uninteresting translation events 
can be suppressed in the output to avoid overwhelming noise. 
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The main result is that, in the strong model, all 214 litmus tests and 14 pKVM 
tests are allowed or forbidden as intended, based on our discussion with Arm of 
their architectural intent, except two pKVM tests which time out. Additionally, 
we tested that the weak model never forbids any test allowed by the strong 
model. The tool performance is eminently usable in practice: most tests take 
around 1 minute, and the full set of litmus tests can be run in less than 2 hours 
CPU time, on a 36-core Intel Xeon Gold 6240. 

We also ran our model on an existing suite of “user” litmus tests, including 
1927 additional generated tests, with a constant identity-mapped pagetable and 
checked the results match RMEM [31] and the official Armv8-A model [26,49,13]. 


6.2 Experimental testing of hardware 


Validation of the models through experimental testing has been a vital part of 
past relaxed memory semantics [24,54,3,8]. This is equally true here. However 
experimental testing of the concurrent aspects of virtual memory is a far harder 
problem: these tests need to be able to access privileged parts of the instruction 
set; they need to be able to setup and use their own exception handlers, pre- 
venting building these tools ontop of standard distributions like Linux; Stage 2 
tests and bare-metal Stage 1 tests require direct access to hardware, preventing 
the use of hypervisors such as KVM around the harness. To achieve this we 
build a harness that can run bare-metal on Armv8 devices to run Stage 1 (but 
as yet, not Stage 2) concurrent virtual memory litmus tests, which can be found 
at https: //github.com/rems-project/system-litmus-harness. At present this and 
Isla use different test formats, so we have some tests manually written in both. 

We ran tests on three devices with standard Arm cores (A53, A72). The data 
we collected suggests that in practice, aside from known errata, these cores: re- 
spect coherence over physical locations; correctly implement TLB maintenance; 
are multi-copy atomic w.r.t translation-table walks; and generally do not dis- 
agree with our model, except in one instance where we observed an anomalous 
result which is under discussion with Arm. 

Further testing on other platforms would be desirable, but our emphasis in 
this work is principally on exploring the design space and capturing the archi- 
tectural intent, and the main validation is from discussion with the Arm Chief 
Architect, who ultimately is responsible for determining what the architecture 
is. In this context, experimental data serves mainly to provide reassurance that 
some envisaged architecture strength is not invalidated by extant hardware im- 
plementations. 


7 Related work 


There is extensive previous work on “user” relaxed-memory semantics of 
modern architectures, but very little extending this to cover systems as- 
pects such as virtual memory. We build on the approaches established in 
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“user” models for x86, IBM Power, Arm, and RISC-V, combining executable- 
as-test-oracle models, discussion with architects, and experimental test- 
ing [54,5,7,47,55,53,21,52,46,9,36,31,32,49,64]. 

Arm publish a machine-readable version of their Armv8-A relaxed memory 
model [45], in the Cat language of the Herd7 tool [6], but that model does 
not currently cover the relaxed virtual-memory semantics. Independent work 
in progress by Alglave et al. is similarly aiming to characterise this, and to 
update Arm’s published model in due course, but with complementary scope 
to the current paper: including hardware updates of access and dirty bits, but 
without integration with the full ASL/Sail instruction semantics and its multiple 
levels and stages of translation. Both have been informed by discussion with 
senior Arm staff, and one would hope to synthesise the understanding in future. 
Hossain et al. [39] develop an “estimated” model for virtual memory in x86 
(which has a much less relaxed base semantics) in a broadly similar axiomatic 
style. Tao et al. [61] axiomatise six conditions for weak data-race-freedom that 
should be satisfied by Armv8-A kernel code that uses virtual memory in simple 
ways, and an extension of Promising-Arm [50] that effectively builds in these 
conditions; they extend the sequential verification of the SeKVM hypervisor by 
Li et al. [43] to show it satisfies these conditions. The paper does not attempt 
to characterise the exact guarantees provided by the Armv8-A architecture, or 
discuss the issues of our §3. A foundational model such as our §5 would let one 
ground such results on the actual architecture. Simner et al. [56] study relaxed 
instruction-fetch semantics. 

Several works give non-relaxed-memory semantics for Arm or x86 address 
translation, more or less simplified and with or without TLBs: Bauereiss [14], 
Goel et al. [34,35], Syeda and Klein [57,59,58,60], Degenbaev [29] (used for veri- 
fication of a hypervisor shadow pagetable implementation [42,28,11,10]), Barthe 
et al. [19,17,18,16], Tews et al. [62], Kolanski [41], and Guanciale et al. [38]. 
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Abstract. Memory safety bugs continue to be a major source of secu- 
rity vulnerabilities in our critical infrastructure. The CHERI project has 
proposed extending conventional architectures with hardware-supported 
capabilities to enable fine-grained memory protection and scalable com- 
partmentalisation, allowing historically memory-unsafe C and C++ to 
be adapted to deterministically mitigate large classes of vulnerabilities, 
while requiring only minor changes to existing system software sources. 
Arm is currently designing and building Morello, a CHERI-enabled pro- 
totype architecture, processor, SoC, and board, extending the high-per- 
formance Neoverse N1, to enable industrial evaluation of CHERI and 
pave the way for potential mass-market adoption. However, for such a 
major new security-oriented architecture feature, it is important to es- 
tablish high confidence that it does provide the intended protections, and 
that cannot be done with conventional engineering techniques. 

In this paper we put the Morello architecture on a solid mathemat- 
ical footing from the outset. We define the fundamental security prop- 
erty that Morello aims to provide, reachable capability monotonicity, and 
prove that the architecture definition satisfies it. This proof is mechanised 
in Isabelle/HOL, and applies to a translation of the official Arm spec- 
ification of the Morello instruction-set architecture (ISA) into Isabelle. 
The main challenge is handling the complexity and scale of a production 
architecture: 62,000 lines of specification, translated to 210,000 lines of 
Isabelle. We do so by factoring the proof via a narrow abstraction cap- 
turing essential properties of arbitrary CHERI ISAs, expressed above 
a monadic intra-instruction semantics. We also develop a model-based 
test generator, which generates instruction-sequence tests that give good 
specification coverage, used in early testing of the Morello implementa- 
tion and in Morello QEMU development, and we use Arm’s internal test 
suite to validate our model. 

This gives us machine-checked mathematical proofs of whole-ISA se- 
curity properties of a full-scale industry architecture, at design-time. To 
the best of our knowledge, this is the first demonstration that that is 
feasible, and it significantly increases confidence in Morello. 
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1 Introduction 


Memory safety bugs continue to be a major source of security vulnerabilities, re- 
sponsible for around 70% of those addressed by Microsoft security updates, and 
around 70% of the high-severity bugs impacting Chromium [30,14]. Their root 
causes are well-known legacy design choices and limitations of normal practice: 
pervasive uses of systems programming languages that do not enforce memory 
protection; hardware that enforces only coarse-grain protection, using virtual 
memory; and test-and-debug development methods that cannot provide high as- 
surance. These are baked in to the critical systems codebase across the industry, 
and the result, in today’s adversarial environment, is that programming errors 
can often lead to exploitable vulnerabilities. 

There are many possible approaches to improving this situation, including 
development of safer programming languages, techniques for full functional- 
correctness verification, and better bug-finding tools. Each is the subject of much 
research in programming languages and semantics, and all are worthwhile, but 
the legacy investment, the need for systems code to work close to the machine, 
and the inability of bug-finding to provide high assurance, have made it very 
hard to radically improve mass-market systems. 

Another path, less well explored, is to change the architectural interface to 
provide hardware mechanisms that enable better enforcement of memory pro- 
tection. Over the last twelve years, the CHERI project [1] has been extend- 
ing conventional hardware Instruction-Set Architectures (ISAs) with new archi- 
tectural features to enable fine-grained memory protection and highly scalable 
software compartmentalisation. The CHERI memory protection features allow 
historically memory-unsafe programming languages such as C and C++ to be 
adapted to have quite different semantics, replacing many unpredictable unde- 
fined behaviour (UB) cases with predictable fail-stop traps, to provide strong 
and efficient protection against many currently widely exploited vulnerabilities. 
Crucially, this requires only minor changes to the sources of existing systems 
software. The CHERI scalable compartmentalisation features enable the fine- 
grained decomposition of operating-system (OS) and application code, to limit 
the effects of security vulnerabilities. 

CHERI provides these via hardware support for unforgeable capabilities: in 
a CHERI ISA [54], instead of using simple 64-bit machine-word virtual-address 
pointer values to access memory, restricted only by the memory management 
unit (MMU), one can use 128+1-bit capabilities that encode a virtual address 
together with the base and bounds of the memory it can access. Encoding these 
within the capability enables a fast access-time check, faulting if there is a safety 
violation. A one-bit tag per capability-sized and aligned unit of memory, cleared 
in the hardware by any non-capability write and not directly addressable, en- 
sures capability integrity by preventing forging, and the ISA design lets code 
shrink capabilities but never grow them. This architectural mechanism, along 
with additional sealed-capability features for secure encapsulation, can be used 
by programming language implementations and systems software in many ways. 
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Previous academic work on CHERI has developed CHERI-MIPS and CHERI- 
RISC-V architectures, FPGA processor implementations, and system software 
including adaptions of Clang/LLVM, linkers, debuggers, FreeRTOS, FreeBSD, 
and WebKit. The CHERI processor prototypes implement techniques such as 
compressed capability bounds [58], and a tag controller and cache [26] required 
to implement memory tagging on off-the-shelf DRAM. The software prototypes 
use CHERI’s architectural features to implement memory-safe CHERI C/C++ 
programming languages [55], fine-grained spatial memory safety [15], heap tem- 
poral memory safety [15], and scalable software compartmentalisation [57]. An 
analysis of vulnerabilities reported to the Microsoft Security Response Center 
(MSRC) in 2019 suggested that CHERI memory safety would have determinis- 
tically mitigated 30%-70%, depending on the usage scenario [27], and porting 
the FreeBSD kernel and userspace to CHERI required changes only to 0.18% 
and 0.04% LoC respectively. Analysis of an open-source desktop stack [53] esti- 
mated a 73.8% vulnerability mitigation rate through a combination of memory 
protection and software compartmentalisation requiring a 0.026% LoC change. 


Achieving widespread adoption of any substantial new architectural feature 
is also challenging, of course, but the issues differ from those for adoption of 
a new high-level programming language. It needs coordinated hardware and 
software change, which is hard to arrange, but on the plus side there are very 
few architecture vendors, so if a feature becomes (say) part of the mainline Arm 
architecture, and there is pull from major partners, then it will be implemented 
in all conforming Arm implementations and become ubiquitously available in 
devices. For CHERI, the academic results are encouraging, but achieving such 
adoption first needs an industry-scale evaluation of a high-performance silicon 
processor implementation and software stack above it, to demonstrate viability 
and enable that pull. This is beyond what can be done academically, but hard to 
justify as a purely commercial project. The 2019-24 UKRI Digital Security by 
Design (DSbD) challenge resolves this chicken-and-egg difficulty with a combined 
public-sector and industry (£70m+117m) programme to build and evaluate such 
demonstration platform, and support research and development above it [52]. 


Arm, supported in part by DSbD, is currently designing and building Morello, 
a CHERI-enabled prototype architecture, processor, system-on-chip (SoC), and 
development board, extending the Armv8.2-A architecture and the high-perfor- 
mance Neoverse N1 processor [6,8]. The Morello processor and SoC implement 
the CHERI ISAv8 protection model, and utilise CHERI’s compressed capabil- 
ity bounds and tagged memory approaches. As of 2021-01, the architecture, 
emulators, initial development boards with Morello silicon, and initial software 
toolchains, have all been developed. This will allow evaluation of the CHERI 
mechanisms in a variety of configurations and use cases on a state-of-the-art 
hardware platform, and paves the way for the potential adoption of CHERI into 
future production architectures and devices. 


In this paper, we describe work to put the Morello architecture and its se- 
curity properties on a solid mathematical footing from the outset, and to use 
semantics to ease conventional engineering. 
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Fig. 1. From Morello ASL source (blue) to auto-generated artifacts (yellow) and veri- 
fication outcomes (green) 


For a new architecture that aims to provide security guarantees, it is es- 
pecially important to provide high assurance that it actually does. Otherwise, 
any security flaw in the architecture will be present in any conforming hardware 
implementation, quite likely impossible to fix or work around after deployment, 
and the resulting loss of confidence might make further adoption impossible. 

For Morello, this is challenging in two ways. First, CHERI needs to be deeply 
integrated into each base architecture it gets adapted to, most obviously by mod- 
ifying all virtual-memory-accessing instructions to check bounds and permissions 
of capabilities, and by adding instructions to explicitly manipulate capabilities, 
but also in more subtle ways relating to exceptions, virtualisation, and so on. 
Second, the architecture specification is large and complex. The base Armv8-A 
architecture is defined in an 8200-page manual [7], to which the Morello archi- 
tecture supplement adds 1200 more [8]. Fortunately, Arm have recently shifted 
to using an executable version of their ASL language for instruction-set archi- 
tecture specification [40,41]. The sequential behaviour is all defined in ASL, and 
this is what appears in instruction descriptions and auxiliary functions (e.g. for 
capability compression and address translation) in the documentation. However, 
it remains very large, 62 000 non-whitespace lines of specification (LoS), and ASL 
does not itself have a mechanised semantics. 

The main intended security property of the Morello architecture is reachable 
capability monotonicity, with the intuition that the available capabilities cannot 
be increased during normal execution (i.e., they are monotonically decreasing). 
This is a whole-system property about arbitrary machine execution, and conven- 
tional techniques cannot provide high assurance that the architecture satisfies 
it. Instead, it needs proof. We translate the Arm ASL definition via the Sail [9] 
language into Isabelle /HOL [39], extending previous work for Armv8-A, and give 
a mechanised statement and proof that the property holds of the architecture. 

We deal with the challenge of scale by factoring the proof via a narrow 
abstraction: four relatively simple properties of arbitrary CHERI instruction ex- 
ecution that capture essential aspects of their behaviour. Our intra-instruction 
semantics focusses on the behaviour of instructions in isolation, interacting with 
registers and memory, rather than viewing each thread as a single state machine; 
this monadic interface lets us conveniently express these abstract-CHERI prop- 
erties of instructions in terms of their register and memory effects. We prove 
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capability monotonicity for arbitrary sequences of instructions above this ab- 
straction, and we instantiate the abstraction for Morello and prove that its many 
instructions satisfy the required properties. Manual proof effort was required for 
a number of helper functions defined in the architecture for manipulating and 
using capabilities, but the bulk of the architecture is handled by automatic proof 
tools and tactics. Previous work by Nienhuis et al. [38] proved similar results for 
the much simpler and smaller (6k LoS) CHERI-MIPS architecture with a dif- 
ferent approach, manually defining a larger set of abstract actions and proving 
that those do abstract the instruction semantics. That let one capture instruc- 
tion intentions more explicitly, but needed more ad hoc machinery, while the 
new approach we follow here handles the 10x scale-up successfully. 

Our proof was developed while the architecture and hardware design were 
still evolving, using weekly snapshots of Arm’s ASL specification, with our au- 
tomation letting us quickly adapt to changes. This let us identify a number of 
bugs that could be fixed before the architecture and hardware were finalised. 

To validate the ASL-to-Sail translation of the Morello specification, we used 
the C emulator automatically generated from the Sail model to compare it 
against Arm’s internal Architecture Compliance Kit (ACK) test suite. 

Finally, we developed a test generator, using the Isla symbolic execution tool- 
ing for Sail [10], to automatically generate interesting instruction-sequence tests, 
aiming at good specification coverage. These complemented Arm’s test suite and 
were used by Arm as part of their pre-tape-out validation, and were used as the 
main test suite for development of a Morello version of the QEMU emulator. 
This helped uncover some bugs in our own tooling as well as discrepancies be- 
tween different Morello models and emulators. We also used Isla and an earlier 
Sail-to-SMT flow for quick checking of properties of capability compression. 

To summarise, our contributions are: 

— A formal and executable semantics of the Morello ISA (§3), automatically 
translated from the Arm ASL to Sail, Isabelle, and C, and validated against 
the Arm ACK (§6). 

— An abstract characterisation of the essential properties of CHERI ISA in- 
structions, expressed over their intra-instruction semantics (§4). 

— A mechanised proof of capability monotonicity for the full sequential Morello 
ISA specification (including all instructions, system registers, capability com- 
pression, etc.), with large parts of the proof automatically generated, making 
the proof more maintainable as the architecture was developed (§5). 

— Automatic ISA test generation from the specification (§7). 

This gives us machine-checked mathematical proofs of whole-ISA security 
properties of a full-scale industry architecture, at design-time. To the best of our 
knowledge, this is the first demonstration that that is feasible, and it significantly 
increases confidence in Morello. 

The main proof took only around 24 person-months, by two people between 
2020-03 and 2021-07, following around 23 person-months of preliminary work 
to get the model into usable Sail and Isabelle forms, to develop our CHERI 
abstraction in the context of earlier CHERI architectures, and on our Sail-to- 
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SMT flow. Test generation and ACK validation took an additional 17 person- 
months, including Morello-specific work on Isla. This suggests that such proof 
could be not just technically but also economically viable for new architecture 
design, particularly as doing this routinely, as an established flow, would reduce 
the effort substantially. 

As a side benefit, our well-validated Morello semantics is reusable for future 
software or hardware verification. The Armv8-A ISA is, along with x86, one 
of the two most important low-level programming languages, and if Morello is 
successful, then one would expect CHERI extensions to be similarly widely used. 

Sail and Isabelle versions of the Morello specification, as well as our definitions 
and proofs, are available online [3]. 


Non-goals and limitations (1) Our results establish confidence that the Morello 
instruction set architecture design satisfies its fundamental intended security 
properties. We do not address correctness of the Morello hardware implemen- 
tation of that architecture, which would be an extremely challenging hardware 
verification task, and we do not cover system components that are not specified 
by the ISA itself, e.g. the Generic Interrupt Controller (GIC). (2) The archi- 
tecture, as usual, expresses only functional correctness properties, not timing 
or power properties, to allow hardware implementation freedom. Properties and 
proofs about the architecture therefore cannot address side channels, but see [56] 
for discussion of side-channels and CHERI. (3) We consider only the sequential 
architecture. Studying concurrency effects would require a more complex system 
model integrating the Morello sequential semantics with a whole-system concur- 
rency memory model, which we leave to future work, but we expect the capability 
properties to be largely orthogonal to concurrency issues, as long as the write 
of a capability body and tag appear atomic. (4) We assume an arbitrary but 
fixed translation mapping. CHERI capabilities are in terms of virtual addresses, 
so system software that manages translations has to be trusted or verified. We 
also assume that the privileged capability creation instructions are disabled and 
no external debugger is active, because these features can in general be used to 
circumvent the capability protections, as discussed in §5.1. (5) Our capability 
monotonicity property is the most fundamental property one would expect to 
hold of a CHERI architecture, but it is by no means the only such property. 
However, stronger properties typically involve specific software idioms, e.g. call- 
ing conventions or exception handlers, and their proofs use techniques that have 
not yet been scaled up to full architectures. We return to this in §8. (6) We 
prove monotonicity of the Morello specification formally in Isabelle, however, 
our proof depends on an SMT solver as an oracle for one lemma, as discussed in 
§5. (7) Our conversion from ASL via Sail to Isabelle is not subject to verification, 
as neither ASL nor Sail have an independent formal semantics — their semantics 
is effectively defined by this translation. However, it is nontrivial, and there is the 
possibility of mismatches with the Sail-generated C emulator used for validation; 
we do not attempt to verify that correspondence. (8) The ASL specification is 
subject to the limitations documented by Arm in [7, Appendix K14], e.g. with 
respect to implementation-defined behaviour. 
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2 Overview of the Morello CHERI Architecture 


CHERI is an architectural protection model that extends ISAs with a new data 
type, the architectural capability [54]. The Morello architecture adds CHERI 
capabilities to Armv8.2-A, the ISA implemented by the Neoverse N1 CPU on 
which the Morello hardware implementation is based [8]. 


2.1 CHERI Capabilities on Morello 


CHERI capabilities are twice the natural address size of the architecture plus an 
out-of-band tag bit, which is not independently addressable; for Morello, capa- 
bilities are 128+1 bits. The lower 64 bits are the “value”, which in most cases rep- 
resents a virtual address. The upper 64 bits encode metadata, including bounds, 
permissions, and other mechanisms. The tag provides integrity protection: it is 
preserved only by legitimate operations on capabilities, and cleared by others. 
A capability can only be used as such, e.g. for a dereference, if its tag is set. 


(o 


perms[17:2] leg) otype[14:0] bounds[86:56] 


value[63:0] 


A sophisticated compression scheme allows a capability to include 64-bit 
lower and upper virtual-address bounds, encoded into 87 bits in total, with 56 of 
those shared with the value field (see [8, §2.5.1],[58] for details). Small regions can 
be described precisely, with an arbitrary size in bytes, while for larger regions, 
only certain bounds and sizes are expressible. The capability value must be either 
within the bounds or within a certain range above or below, allowing for common 
C idioms that transiently construct (but do not dereference) slightly out-of- 
bounds pointers; other combinations of value and bounds are not representable. 
This scheme trades off bounds precision for reduced capability size: supporting 
arbitrary bounds would require more than 128+1 bits per capability, which would 
have unacceptable performance costs. 

Four of the 18 permission bits are reserved for software, while the others have 
architecturally defined meaning. The Load, Store, and Execute permissions con- 
trol whether a capability can be used for loading or storing data or fetching 
instructions. Permission bits for loading and storing capabilities, as opposed to 
data, also exist. The System permission controls access to system registers and 
operations, in addition to the access control mechanisms of the base Arm archi- 
tecture. Capabilities can also be sealed, making them immutable and unusable 
for anything but branching to them; this allows controlled transitions between 
different security domains. Sealing (or unsealing) a capability requires an au- 
thority capability with the Seal (or Unseal) permission; more on this below. 


2.2 Capabilities in Registers and Memory 


Morello extends the Armv8-A general-purpose integer register file, as well as cer- 
tain control and status registers, from 64 bits to 128+1 bits. Memory is extended 
with a tag bit for each 128-bit sized and aligned unit of DRAM. 
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The Program Counter (PC) is extended to become a Program-Counter Ca- 
pability (PCC), constraining instruction fetch as well as PC-relative loads (e.g., 
of global variables). A new Default Data Capability (DDC) special register con- 
trols and transforms memory accesses relative to machine-word pointer values 
by legacy (non-capability) instructions, for legacy code using integer pointers. 


2.3 Capability-aware Instructions 


Morello extends Armv8-A with new instructions and modifies existing instruc- 
tions to use and respect capabilities. For example, a Load capability (literal) 
instruction LDR <Ct>,<label> calculates an address from the PCC value and an 
immediate offset, loads a capability from memory, and writes it to capability 
register Ct [8, §4.4.76]. If the PCC capability does not have the load permission, 
or the calculated address is outside its bounds, a capability fault exception is 
raised. The tag of the PCC capability is also checked (as part of instruction 
fetching). Most other instructions authorise loads and stores via a capability in 
an explicitly identified register, or use DDC, rather than implicitly use PCC. 

Conventional execution flow is also controlled by capabilities, with branch 
instructions to capability destinations (or implicitly w.r.t. the PCC for legacy 
instructions). Here too the capability must have its tag set and the target virtual 
address must be within the bounds, and in this case it must authorise execution. 

Then there are instructions to access and manipulate the fields of a capa- 
bility, including arithmetic on its virtual-address value field (corresponding to 
conventional pointer arithmetic), comparisons, and other operations to extract 
and manipulate its permissions and other data. 


2.4 Domain Transition 


CHERI distinguishes between sealed and unsealed capabilities. An unsealed ca- 
pability can be used directly (e.g. to load and store), but a sealed capability can 
only be used to request actions be taken by other software. This feature can be 
used in the context of protection domains or software compartments, in which 
whole subsystems are given access to a limited subset of memory. 

Domain X may have no direct authority to domain Y, but may call into 
domain Y by invoking one or more sealed capabilities originally sealed by (or 
for) Y. The invocation will install unsealed versions of the invoked capabilities 
in registers. This always includes replacing the current PCC, thus, this performs 
a jump to a specific code entry point provided by domain Y. These domain 
transitions are non-monotonic and must be treated specially in our proof. 

Variations on this sealing and invocation mechanism enable slightly different 
calling styles. When sealing capabilities, they can be labelled with an object type, 
if the authorising capability has that object type in its bounds. The “branch to 
sealed capability pair” instruction invokes a given code capability and also an ar- 
gument data capability, checking their object types match, providing object-style 
encapsulation. Three kinds of specialised sentry (sealed entry) capabilities may 


182 T. Bauereiss et al. 


be used transparently by direct branch instructions, memory-indirect branch 
instructions, and memory-indirect branch-to-pair instructions, respectively. 


2.5 Exceptions and the Memory Management Unit 


In addition to compiler-facing instructions, system functionality such as virtual 
memory, cache management, and exception handling is also extended, e.g. adding 
new exception cause codes, and page-table permission bits for loading or storing 
capabilities. Because exception handling is able to restore reserved registers dur- 
ing exception-level transitions, it is also a form of domain transition, as reserved 
registers may contain capabilities not available to the executing code. 


2.6 Using CHERI in Software 


For context, we sketch how CHERI’s capability mechanisms are used by soft- 
ware to control and constrain execution. The CHERI team has adapted a large 
open-source software stack to CHERI, including the LLVM compiler, linkers, 
debuggers, multiple OSs, and application suites. The verification in this paper 
is motivated by this software usage, but is itself purely about the architecture. 

One of the main uses of capabilities is fine-grain memory protection. Spatial 
memory safety is achieved in CHERI C/C++ by implementing explicit point- 
ers (those visible in the language, e.g. variables with pointer type) and implied 
pointers (used by the generated code and runtime, e.g. the stack pointer, PLT en- 
tries, and Global Offset Table pointers) with capabilities instead of conventional 
machine-word integers. These are protected (from corruption or reinjection) by 
the CHERI tag mechanism and monotonicity, and hence the memory contents 
they point to are protected, by the capability permissions and bounds checks, 
so long as no other capabilities give undesired access to them. This relies on 
compiler-generated code, the kernel, run-time linker, and C runtime (e.g., heap 
allocator) narrowing capability bounds and permissions during execution as ap- 
propriate. This protects against many cases in which a C/C++ coding error 
could lead to an exploitable vulnerability. 

Temporal memory safety, additionally protecting against reuse-after-reallo- 
cation errors, is not directly supported by the architecture, but there are a 
variety of techniques to implement it, especially for heap memory, using CHERI’s 
features [22]. Morello extends the page-table mechanism to allow capability flow 
to be tracked through memory, supporting revocation of old capabilities. 

The other main use of CHERI is software compartmentalisation, splitting the 
address space into different compartments running separate software. The capa- 
bility monotonicity property ensures these components are contained in their 
compartment boundaries. Domain transitions are possible via the sealed capa- 
bility mechanism, which can be used to set up various inter-compartment inter- 
faces. Often these transitions will all be to a privileged control component, but 
the architecture also supports direct transition between two mutually distrusting 
pieces of code. Various software models are supported, from implementing fast 
inter-process IPC to sandboxed libraries within processes. 
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1 function clause __DecodeA64 ((pc, ([bitone,bitzero,bitzero,bitzero,bitzero,bitzero, 
2 bitone,bitzero,bitzero,bitzero,_,—,—,—,—)—)—9—p— pp — pr - 
3 as __opcode)) if SEE < 99) = { 
4 SEE = 99; let imm17 = Slice(__opcode, 5, 17); let Ct = Slice(__opcode, 0, 5); 
5 decode_LDR_C_I_C(imm17, Ct) } 
6 
7 val decode_LDR_C_I_C : (bits(17), bits(5)) -> unit 
8 function decode LDR_C_I_C (imm17, Ct) = { 
9 let 't = UInt(Ct); 
10 let offset : bits(64) = SignExtend(imm17 @ 0b0000, 64); 
11 execute_LDR_C_I_C(offset, t) } 
12 
13 val execute_LDR_C_I_C : forall (’t:Int), (O<=’t & 't<=31). (bits(64),int(’t)) -> unit 
14 function execute_LDR_C_I_C (offset, t) = { 
15 CheckCapabilitiesEnabled() ; 
16 let base : VirtualAddress = VAFromCapability(PCC) ; 
17 let address : bits(64) = Align(VAddress(base) + offset, CAPABILITY_DBYTES) ; 
18 VACheckAddress(base, address, CAPABILITY_DBYTES, CAP_PERM_LOAD, AccType_NORMAL) ; 
19 data : bits(129) = MemC_read(address, AccType_NORMAL) ; 
20 let data : bits(129) = CapSquashPostLoadCap(data, base); 
21 C_set(t) = data } 
22 
23 val VACheckAddress : forall (’size : Int). 
24 (VirtualAddress, bits(64), int(’size), bits(64), AccType) -> unit 
25 function VACheckAddress (base, addr64, size, requested_perms, acctype) = { 
26 c : bits(129) = undefined; 
27 if VAIsBits64(base) then { c = DDC_read() } 
28 else { c = VAToCapability(base) }; 
29 ——ignore_15 = CheckCapability(c, addr64, size, requested_perms, acctype) } 
30 
31 val CheckCapability : forall (’size : Int). 
32 (bits(129), bits(64), int(’size), bits(64), AccType) -> bits(64) 
33 function CheckCapability (c, address, size, requested_perms, acctype) = { 
34 let el : bits(2) = AArch64_AccessUsesEL(acctype) ; 
35 let ‘'msbit = AddrTop(address, el); 
36 let sl_enabled : bool = AArch64_IsStageOneEnabled(acctype) ; 
37 addressforbounds : bits(64) = address; [...7 lines setting addressforbounds... 
38 fault_type : Fault = Fault_None; 
39 if CapIsTagClear(c) then { fault_type = Fault_CapTag } 
40 else if CapIsSealed(c) then { fault_type = Fault_CapSeal } 
Al else if not_bool(CapCheckPermissions(c, requested_perms) ) 
42 then { fault_type = Fault_CapPerm } 
43 else if (requested_perms & CAP_PERM_EXECUTE) != CAP_PERM_NONE 
44 & not_bool(CapIsExecutePermitted(c)) then { fault_type = Fault_CapPerm } 
45 else if not_bool(CapIsRangeInBounds(c, addressforbounds, size[64 .. 0])) 
46 then { fault_type = Fault_CapBounds }; 
47 if fault_type != Fault_None then { 
48 let is_store : bool = CapPermsInclude(requested_perms, CAP_PERM_STORE) ; 
49 let fault : FaultRecord = CapabilityFault(fault_type, acctype, is_store); 
50 AArch64_Abort(address, fault) }; 
51 return(address) } 


Fig. 2. Sample Morello instruction semantics, in Sail, for parts of the LDR (lit- 
eral) instruction [8, §4.4.76] for loading a capability from a PCC-relative address. 
Lines 1-5 are the relevant opcode pattern-match clause. That calls the decode func- 
tion on Lines 7-11, which calls the execute function on Lines 13-21. That uses 
auxiliary function VACheckAddress (Lines 23-29) to check that the PCC capability 
(wrapped in a VirtualAddress structure) has the right bounds and permissions, rais- 
ing an exception otherwise (Lines 47-50). MemC_read (Line 19) performs the load, and 
CapSquashPostLoadCap (Line 20) performs additional checks, in particular clearing the 
tag of the loaded capability if the authorising capability does not have capability load 
permission. 
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3 Concrete Semantics of Morello 


The basis for our verification and validation work for Morello is the ISA speci- 
fication written by Arm in their ASL language. It includes sequential semantics 
of the capability mechanisms and instructions, along with all of the Armv8-A 
AArch64 base architecture and its extensions supported by Morello, e.g. float- 
ing point and vector instructions, system registers, exceptions, user mode, sys- 
tem mode, hypervisor mode, some debugging features, and virtual memory ad- 
dress translation. In total, the Morello ASL specification is around 62000 non- 
whitespace lines, covering 409 instructions, 1050 encodings, 600 automatically 
generated accessor functions for reading and writing system registers, and 1500 
additional helper functions. Arm provided weekly snapshots of the ASL specifi- 
cation while it was being developed. 

ASL is a first-order imperative language with exceptions. Originally a pa- 
per language only, it was made executable by Reid et al. [40,41]. It supports 
bitvectors of computed sizes, but bitvector indexing is not statically checked; 
it also supports mathematical integers and some limited structured types. The 
Arm documentation provides an informal description of the language [7, Ap- 
pendix K14], but does not provide a formal semantics. We obtain a formal se- 
mantics of Morello by translating the ASL specification into Sail [9], a similar 
language but with a richer type system and open-source tooling, and thence into 
Isabelle/HOL, as 90000 and 210000 LoS respectively. Fig. 2 shows parts of the 
Sail semantics for the Morello LDR (literal) instruction for loading a capability 
from a PCC-relative address. This is just an iceberg-tip of the whole semantics, 
even just for this instruction: the MemC_read involves all of address translation, 
and the call graph of the definitions shown amounts to 7 300 lines of Sail. 

We reused the existing open-source Sail tooling and ASL-to-Sail transla- 
tion [9,10] mostly as-is, with only minor improvements and some engineering 
work needed to handle Morello. In addition to the Isabelle definitions, we gen- 
erate a C emulator for validation (§6) using the Sail tool, and we reuse the Isla 
symbolic execution engine for Sail [10] to generate tests (§7). 


4 Abstract Formal Model of Capability Monotonicity 


The main challenge in proving whole-ISA security properties of Morello is the 
scale and complexity of the model. Rather than a direct proof above the 210 000- 
line Isabelle specification, we factor the proof via an abstraction (instantiated 
for Morello in §5) that captures the essential properties of arbitrary instruction 
behaviour in any CHERI ISA. It has to spell out aspects of CHERI in some 
detail, e.g. the different kinds of non-monotonic domain transitions (cf. §2.4), but 
it abstracts away ISA details not directly relevant for capability monotonicity. 


4.1 ISA Abstraction 


The abstraction is defined as properties of an arbitrary sequential ISA semantics, 
encoded in a monadic type with a trace semantics that exposes the individual 
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register and memory effects of instructions. This interface was originally designed 
to connect Sail ISA semantics to relaxed memory models, but we found the 
factorisation via effects useful for reasoning even in a simple sequential setting. 

The monad essentially corresponds to a free monad over an effect datatype. 
It is parameterised with a return type ‘a, an exception type 'e, and a sum type 
of register value types ’regval (automatically generated by Sail for each ISA): 
type M 'regval ’a ’e = 

| Done of '’a | Fail of string | Exception of ’e 

| Read_memt of kind * addr * nat * ((bytes * tag) -> M ‘regval 

| Read_reg of register_name * (’regval -> M ‘regval ’a ’e) 


, 


$ , 


a ’e) 


Finished outcomes either indicate successful termination with a return value a 
(denoted as Done a), an exception (Exception e) which can be caught using a 
try_catch combinator, or a failure (Fail msg), e.g. due to a failed assertion. Ef- 
fect outcomes carry a continuation that expects a response and returns the next 
monadic outcome. Monadic return wraps a value in Done, while bind just nests 
the outcomes without interpreting the effects. We also define a corresponding 
type of events, e.g. E_read_reg (with only concrete values, not continuations), 
along with an effect trace semantics for monadic expression. We define our re- 
quirements on CHERI ISAs in terms of constraints on these traces in §4.4. 


4.2 CHERI ISA Parameters 


In addition to the ISA semantics themselves, our properties are parameterised on 
aspects of the ISA relevant to CHERI. This includes names of special registers, 
in particular the program counter capability register PCC, the invoked data 
capability register IDC (capability register 29 on Morello, r31 on CHERI-RISC- 
V), registers holding capabilities to exception handlers (VBAR_ELn on Morello), 
and privileged registers requiring system register access permission. 

Moreover, we need to know which instructions may perform sealed capability 
invocations, as this potentially constitutes a non-monotonic security domain 
transition. We model this as functions taking an instruction identifier and an 
effect trace of a particular execution, and returning, respectively, the directly or 
indirectly invoked sealed capabilities in the trace. For example, the Morello BRS 
instruction invokes the sealed capabilities in its two input registers, and other 
branch instructions can also invoke sealed capabilities if they are sentries. 

Finally, the mapping from virtual to physical memory addresses is captured 
by a pure partial function taking a virtual address and a (partial) instruction 
execution trace, from which it can extract the required information about the ad- 
dress mapping to determine the physical address, if any. This is needed because 
capabilities are in terms of virtual addresses, but the memory effects produced 
by the ISA semantics are in terms of physical addresses, so we need a way to 
translate between those when formulating requirements on memory accesses in 
the abstract model. We also assume another function as a parameter to distin- 
guish memory operations that happen as part of an in-memory translation table 
walk, as the constraints on them differ from those on other memory operations. 
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4.3 Capability Abstraction 


We capture capabilities in the abstract model via a typeclass that provides meth- 
ods for accessing the various fields of capabilities, as well as sealing and unseal- 
ing operations. We also define a notion of derivability that serves as an upper 
bound on the capability manipulations that instructions are normally allowed 
to perform. Starting from a set of capabilities C’, e.g. provided as inputs to an 
instruction, the set of capabilities derivable from C' is defined inductively as the 
smallest set that contains C itself as well as capabilities obtained from other 
derivable ones via one of the following: 


— manipulating an unsealed capability cinto c’ such that bounds or permissions 
are not increased, formalised using an ordering where c’ < c iff either c’ = c, 
or c’ is untagged, or both are tagged and unsealed and the bounds and 
permissions of c include those of ¢’; 

— turning a capability into a sealed entry capability; 

— sealing a capability using another derivable sealing authority capability, set- 
ting the object type of the sealed capability to the current address value of 
the authority capability (interpreted as an object type), if the authorising 
capability is tagged and unsealed, has sealing permission, and its value (and 
therefore the object type) is within its bounds; or 

— unsealing a capability using another derivable unsealing authority capability, 
if the latter is tagged and unsealed, has unsealing permission, and its value 
is within bounds and matches the object type of the sealed capability. 


Of these operations, unsealing is the only one that may grant new privileges that 
are not already granted by the input capabilities. However, unsealing requires 
specific authority. An operating system, for example, can control what capabil- 
ities a user-space process can unseal by only handing out unsealing authority 
capabilities with a limited set of object types in their bounds. 


4.4 CHERI ISA Intra-instruction Properties 


Our abstraction is defined as the conjunction of four instruction-local properties. 
They are relatively straightforward to verify for a concrete ISA, and we will 
describe the proof for Morello in §5. At the same time, the properties imply the 
whole-ISA property of reachable capability monotonicity, as explained in §4.5. 
Hence, they serve as a useful intermediate abstraction layer for structuring the 
overall proof. 

The central security guarantee that CHERI ISAs aim to provide is that 
software cannot forge capabilities and thereby escalate its privileges. Hence, we 
require that instructions only produce capabilities via the above derivation rules, 
except for the effects of well-defined transition mechanisms for switching control 
to another security domain. 


Property 1 (Capability register writes). In any execution trace of a single in- 
struction, for every write of a tagged capability to a register at a given point in 
the trace, one of the following holds: 
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1. The capability is derivable from the capabilities that the instruction has 
available at this point in the trace. 

2. The capability is an invoked capability and written to the PCC or IDC 
register as part of a sealed capability invocation. 

3. The capability has been loaded from an exception handler base register and 
is written to the PCC register as part of raising an ISA exception. 


The first case permits the normal operation of instructions, manipulating 
capabilities according to the above derivability rules. We allow instructions to 
use their available capabilities in these operations, which normally includes ca- 
pabilities read from registers or loaded from memory up to the given point in 
the trace, with some exceptions: First, capabilities read from privileged registers 
are unavailable unless the system access permission is also available, i.e. if a 
tagged and unsealed capability with that permission has been read from PCC 
before. Second, we exclude capabilities loaded as part of translation table walks, 
as those loads are not subject to capability checks (although none of the existing 
CHERI ISAs attempt to load capabilities during translation table walks). Third, 
capabilities used in a domain transition, e.g. capabilities loaded from memory 
as part of an indirect sealed capability invocation, are unavailable for normal 
operations and handled separately by the other cases of Property 1 as follows. 

The sealed capability invocation case applies when the capability being writ- 
ten is an invoked capability of the current instruction, as declared when instan- 
tiating the CHERI ISA abstraction (see §4.2). Such an invocation performs a 
branch to the unsealed code capability by writing it to the PCC register, and 
possibly writes an unsealed data capability to IDC. One of the following cases 
must hold, representing the different supported kinds of capability invocation: 


Sealed pair A pair of capabilities sealed with the same, non-sentry object type 
and with BranchSealedPair permission is available, the capability that is 
being written is an unsealed version of one of those, and it is written either 
to PCC and it has the execute permission, or it is written to the invoked 
data capability register IDC and does not have the execute permission. 

Direct sentry The capability is written to PCC, and a version of it that is 
sealed with a sentry object type is available to the instruction. 

Indirect sentry An indirect sentry capability is available and used to load ei- 
ther two capabilities from memory that may be written to the PCC and IDC 
registers, or one capability that may be written to PCC while the unsealed 
version of the indirect sentry itself may be written to IDC. 


The ISA exception case is signalled in the Morello model by the helper func- 
tion AArch64.TakeException throwing a (Sail language) exception after setting 
up the branch to the exception handler. In this case, we allow a capability to the 
exception handler to be read from a privileged exception handler base register 
and written to PCC, even if system register access permission is not available. 
However, the definition of available capabilities together with our properties 
guarantee that this capability is not used for any other operations. 
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let store_cap_reg_aziom ISA has_ex invoked_ caps invoked _indirect_ caps t = 


let use_mem_caps = (invoked_indirect_ caps = {}) in 
(V i cr. (writes_to_reg_at_idxit=Just r A c € (writes_reg_caps_at_idx ISA i t)) 
— 


(* Only store monotonically derivable capabilities to registers *) 
(cap_ derivable (available _caps ISA use_mem_caps i t) c V 
(* ... or perform one of the following non — monotonic register writes: *) 
(* Exception *) 
(has_ex A c € exception_targets_at_idx ISA i t Ar € ISA.PCC) V 
(* Capability pair invocation *) 
(3 ce cd. ((c < (unseal cc) A r € ISA.PCC) V (c < (unseal cd) A r € ISA.IDC)) A 
cap_ derivable (available _caps ISA use_mem _ caps i t) cc A 
cap_ derivable (available_caps ISA use_mem__ caps i t) cd A 
invokable cc cd A c € invoked _ caps) V 
(* Direct sentry invocation *) 
(a cs. c < (unseal cs) A is_sentry cs A^ is_sealed cs A r € ISA.PCC ^ 
cap_ derivable (available _caps ISA use_mem_ caps i t) cs A 
c € invoked _ caps) V 
(* Indirect sentry invocation (writing the unsealed sentry to IDC) x) 
(a cs. c < (unseal cs) A r € ISA.IDC A is_indirect_sentry cs A is_sealed cs A 
cap_ derivable (available_reg_caps ISA i t) cs A 
c € invoked _ indirect _ caps) V 
(* Indirect capability (pair) invocation *) 
(+ (writing the loaded capability/capabilities to PCC/IDC) x) 
(E c. ((e < (unseal c’) A is_sealed c’ A is_sentry c’ A r € ISA.PCC) V 
(e < c Ar € (ISA.PCC U ISA.IDC))) A 
cap_ derivable (available_mem_caps ISA i t) c’ A 
c € invoked_ caps ^ invoked _indirect_ caps # {}))) 


Fig. 3. Formal definition of capability register write Property 1, slightly simplified 


We formalise Property 1 as a predicate on traces, given in Fig. 3. It takes 
a number of arguments that we instantiate using the CHERI ISA parameters 
of §4.2, e.g. with invoked _ caps set to the capabilities that the given instruction 
invokes in the given trace. The predicate details the different cases (and invoca- 
tion subcases) of Property 1 for all capabilities written to registers, using helper 
definitions such as available_ caps or invokable (checking permissions and object 
types of a pair of sealed capabilities). 

The other three properties state that capabilities stored to memory must be 
derivable from available capabilities (here there are no non-monotonic exception 
cases), and that accesses to memory or privileged registers must be authorised 
by capabilities with sufficient permissions and bounds. 


Property 2 (Capability stores). Every tagged capability stored to memory at a 
given point in an execution trace of a single instruction is derivable from the 
available capabilities at that point in the trace. 


Property 3 (Privileged registers). Reads from or writes to privileged registers 
in an execution trace of a single instruction happen only after a tagged and 
unsealed capability with system register access permission has been read from 
PCC, unless an ISA exception is raised in the trace and the event is a read from 
an exception handler base register. 
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Property 4 (Memory accesses). For every load or store event at a given point in 
an execution trace of a single instruction, there is a tagged capability available at 
that point in the trace that authorises the memory operation (further explained 
below), unless the event is part of a translation table walk. The authorising ca- 
pability must be unsealed, unless it is an indirect sentry capability being invoked 
in this trace and the event is a load. If the event is a load or a store of a tagged 
capability, then the address must be aligned to the capability size. 


The authorising capability for memory accesses must be tagged and have the 
right bounds and permissions: the latter must include load/store permission, 
and there must be a virtual address range covered by the bounds of the capabil- 
ity that translates to the physical address range covered by the memory event. 
Loading/storing capabilities (and not just untagged data) requires additional 
permission bits. The authorising capability must also normally be unsealed; the 
only allowed case of using a sealed capability for a memory operation is the 
invocation of an indirect sentry capability. In that case, Property 1 allows the 
loaded capability (or pair of capabilities) to be written to PCC (or IDC). How- 
ever, due to the definition of available capabilities, the loaded capabilities will in 
this case be unavailable for other purposes. Only capabilities loaded via unsealed 
authorising capabilities can be used for regular operations. 

In addition to the instruction semantics, our ISA models also contain ASL/Sail 
code defining instruction fetch and decode behaviour. We use this for generating 
emulators, but also for stating the whole-ISA monotonicity theorem below with 
respect to multi-instruction traces produced by a fetch-decode-execute loop. For 
the fetch segments of these traces, we require the same properties to hold as 
for individual instruction execution traces, with the only difference being in the 
authorisation of memory loads: we assume that instruction fetching only loads 
instructions from memory, so we do not allow instruction fetching to perform 
capability memory loads, and we require that it checks for the execute rather 
than the load permission in the authorising capability. 


4.5 Capability Monotonicity Theorem 


The above single-instruction properties are sufficient to prove a whole-ISA mono- 
tonicity theorem for reachable capabilities. This set of reachable capabilities for a 
given state of the system is defined inductively as the smallest set that includes: 


— capabilities in non-privileged registers, and those in privileged registers if a 
tagged and unsealed capability with system access permission is reachable; 

— in-memory capabilities at capability-aligned virtual addresses, if there is a 
reachable capability that authorises loading the capability; and 

— capabilities derivable from reachable capabilities via the rules of §4.3, i.e. re- 
stricting bounds or permissions, creating sentry capabilities, or sealing /un- 
sealing capabilities (if a suitable authorising capability is also reachable). 


This set is intended to provide an upper bound on the set of capabilities that 
software can construct (on its own) when starting execution in the given state, 
and the monotonicity theorem confirms that it is indeed an upper bound. 
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We assume a sequential setting and state the theorem with respect to ex- 
ecutions of a sequential fetch-decode-execute loop; reasoning about concurrent 
behaviour is beyond the scope of this paper. Executing an effect trace t from 
a state s leading to a state s’, written s 4 s’, is possible if the register and 
memory contents in read events along the trace t correspond to the last written 
values, if any, or the contents in the initial state s otherwise, and if s’ results 
from s by updating register and memory contents with the values in t. 

Proving the instruction-local properties of the last subsection for a concrete 
ISA might also require certain architecture-specific assumptions. We allow the 
specification of both a capability invariant that is preserved by capability deriva- 
tion and assumed to hold initially, and a predicate on traces capturing further 
assumptions, e.g. about system registers. We say that an architecture isa CHERI 
ISA if all possible traces of instruction execution and fetching that satisfy the 
architecture-specific trace assumptions, and that read only capabilities satisfy- 
ing the architecture-specific capability invariants, satisfy the properties of §4.4. 
Reachable capability monotonicity then holds for executions of arbitrary se- 
quences of instructions, unless and until a transition to another security domain 
occurs via an ISA exception or sealed capability invocation. 


Theorem 1 (Reachable Capability Monotonicity). Let t = tf, - te; - tf - 
teg:... be a trace of the fetch-decode-execute loop of a CHERI ISA, alternating 
fetch/decode traces tf; and instruction execution traces te;, and let s be a state 
such that s +5 s'. If all of the following hold: 


all traces tf; and te; satisfy the architecture-specific assumptions, 

the capabilities in s satisfy the architecture-specific capability invariants, 
none of the fetch and execute traces tfi and te; raise an ISA exception, 
the address translation mapping stays invariant along t, and 


as wee 


unsealed versions of the invoked sealed capabilities in t are reachable in s, 
the set of capabilities reachable in s' is a subset of the capabilities reachable in s. 


This guarantees that software cannot escalate its privileges by forging capa- 
bilities that are not reachable from the starting state. Non-monotonic changes 
in the set of reachable capabilities are limited to the specific mechanisms defined 
above for transferring control to another security domain, i.e. ISA exceptions 
or sealed capability invocations, installing capabilities belonging to the new do- 
main in the PCC (and possibly IDC) register. The monotonicity guarantee stops 
before such a domain transition happens. Sealed capability invocations within 
a security domain are monotonic, however; the theorem does cover capability 
invocation instructions, e.g. branch instructions taking sentry capabilities, if the 
unsealed invoked capability is reachable in the current security domain (con- 
dition 5 above). The translation invariance assumption (condition 4) rules out 
non-monotonicity due to the interpretation of capabilities changing when the 
memory mapping changes. It is assumed to hold for the duration of the given 
intra-domain trace, but after a domain transition and return, e.g. a system call, 
one could continue using this theorem with a modified translation mapping. 
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The proof of Theorem 1 starts with an induction on the number of instruc- 
tions in the trace. For each individual subtrace t of an instruction fetch or exe- 
cution with s $ s’, we show that the available capabilities at any point in t are 
reachable in s, as the definition of available capabilities excludes non-monotonic 
cases and only includes capabilities that are accessed with suitable permission 
due to the properties we require. Hence, state updates along t leading to s’ (only 
writing available or invoked, but reachable capabilities due to the requirements 
and assumptions) are monotonic. 


5 Proof of Capability Monotonicity in Morello 


5.1 Instantiation of the Abstract Model 


In order to instantiate Theorem 1 for Morello, we instantiate the parameters of 
the abstract model, e.g. the set of privileged registers or the concrete capability 
representation. We do not currently instantiate the address translation mapping, 
effectively treating address translation as a black box and assuming an arbitrary 
but fixed partial mapping, together with a predicate on events to capture as- 
sumptions on register and memory contents, under which the mapping produced 
by the ASL address translation code is guaranteed to coincide with the given 
mapping. A candidate for instantiating this is the purely functional character- 
isation of address translation presented in [9, §8] and proved correct there for 
the base Armv8.3 architecture, under some assumptions about control registers. 
Using this would also allow (and require) us to substantiate the translation in- 
variance assumption of Theorem 1. In particular, since the translation control 
registers are protected by the system register access permission, code running 
without that permission and without write access to the in-memory translation 
tables cannot modify the translation mapping. 

For the monotonicity proof, the main architecture-specific assumption we 
make is that two privileged system features that could be used to violate mono- 
tonicity are inactive: external debuggers, and the experimental instructions SCTAG 
and STCT that allow setting tags of arbitrary capability bit patterns. Hence, we 
make assumptions on the contents of certain control registers to disable these 
(e.g. EDSCR.STATUS = 2 to model non-debug state); the tag setting instructions 
can also be disabled by removing the system access permission. 

The capability invariant that we assume in the initial state is that bounds 
do not go beyond the 64-bit address space and that their length is non-negative, 
e.g. to rule out memory accesses that wrap around the edge of the address 
space. There exist capability encodings that violate this property, but the only 
way to generate them on Morello is via the tag setting instructions or an external 
debugger, which we assume to be disabled. 

We also assume that the PCC capability is initially unsealed, if it is tagged, 
which the ASL code relies on in a few places. We proved this as an invariant 
after a bug we found in a branching helper function (see §5.4) was fixed. 

Finally, we have to limit certain kinds of “constrained unpredictable” be- 
haviour. For example, the LDP instruction loads a pair of words into two desti- 
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nation registers. However, if the same register index is used for both destination 
register arguments to the instruction, then it is left underspecified what value is 
written to the destination register, if any. One might expect this to be either the 
original register value or one of the loaded values, but Morello inherits from the 
base Armv8-A architecture the specification that the register value may be set to 
an architecturally UNKNOWN value in such cases. For capabilities, the Morello spec- 
ification [8] further constrains this in rule TSNJF: “If an UNKNOWN value is written 
to a capability register or to capability-tagged memory, the write does not in- 
crease the Capability defined rights available to software.” We formalise this by 
adding an assumption that, in traces for which we want to use the monotonicity 
theorem, all UNKNOWN capabilities used (appearing in traces in nondeterministic 
choice events) are reachable from the initial state of the trace. 


5.2 Manual Proofs about Capability Encoding Functions 


We have to prove that the various functions that make changes to the concrete 
129-bit capability representation (as used by the instruction semantics) do so in 
a monotonic way. The challenging aspect is the compressed capability bounds 
encoding introduced in [58] and used by Morello (as opposed to the version of 
CHERI-MIPS targeted by previous verification work [38], which used a simpler, 
uncompressed 256+1-bit encoding). The compression scheme allows the capa- 
bility address value and both bounds, three 64-bit values, to be encoded in less 
than 128 bits. This exploits the fact that in well-behaved code the address should 
be within the bounds or nearby, so the bounds can be expressed as smaller off- 
sets from it. They are encoded in a floating-point style, with an exponent and a 
floating “mantissa” window. Typical smaller capabilities have precise bounds, but 
large capabilities require aligned bounds, to save encoding space; the encoding 
uses various optimisations to maximise precision [58], [8, §2.5.1]. 

We initially SMT-checked the encoding functions using Sail’s existing SMT 
backend. This provided early design feedback, including discovering an issue in 
the CapSetBounds function (see §5.4). 

When moving from SMT checks to Isabelle proofs that can be integrated 
into the overall proof, one challenging function is CapIsRepresentableFast, which 
checks that an update to the capability value by an offset does not change 
the decoding of the bounds. It is important for performance that this check is 
done quickly. This fast version only considers the offset arithmetic within the 
mantissa window, making pessimistic assumptions about overflow/underflow in 
lower bits. We can prove that this check is sufficient, using algebraic methods in 
Isabelle/HOL without bit-blasting or SMT proofs. 

The most challenging function for us to verify is called CapSetBounds, and is 
used to narrow capability bounds. The function checks that the requested new 
bounds fit monotonically in the existing bounds. It also picks an appropriate 
exponent, aligns to that exponent, and encodes an updated capability. 

The main complication is that aligning the bounds to an exponent changes 
the length slightly, which may be an increase that requires a higher exponent. 
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The core argument for monotonicity here is non-trivial: the chosen alignment 
is the minimum one for which bounds can be encoded which enclose the requested 
bounds. Since the original capability also enclosed this range, its alignment can- 
not be less than this minimum, thus the bounds of the original capability are 
already aligned to the selected exponent. This finally implies that coercing the 
requested bounds to the selected exponent does not move them across the orig- 
inal bounds. A part of the proof of this lemma involved a brute-force split into 
cases for all possible selected exponents and reducing the cases to SMT bitvector 
lemmas which we pass to the CVC4 SMT solver [11]. This relies on the solver 
as an oracle, as replay of bitvector proofs in Isabelle is only experimental. Initial 
work on the CHERI compression scheme [58] included HOL4 proofs about these 
two functions, but this is the first time the crucial monotonicity proof has been 
done for the set-bounds function. 


5.3 Proof Engineering 


With the model instantiation and lemmas about auxiliary functions in place, the 
remaining task is to prove that the rest of the ISA uses these functions correctly 
and satisfies the properties defined in §4.4. We tackle this using a combination 
of custom proof tactics within Isabelle and an external tool that automatically 
generates lemmas about the functions and instructions in the architecture. This 
simple approach worked sufficiently well that we were able to keep up with weekly 
snapshots of the ASL specification while it was being developed. Re-running the 
lemma generation tool mostly worked without affecting the existing manually 
written parts of the proof, with only few exceptions, e.g. when a refactoring of 
the (crucial) VACheckAddress function broke some lemmas about it. 

The generated lemmas are stated in terms of predicates that reformulate the 
properties of §4.4 into properties of partial traces, taking an additional param- 
eter that summarises the capabilities available at the start of this part of the 
trace. This allows us to split up an instruction proof into proofs that the auxiliary 
functions satisfy the properties and that they are used correctly, e.g. that a func- 
tion performing a memory store is only called if a suitable authorising capability 
is available. Most of these proofs are automatically handled by straightforward 
proof tactics, but our tooling allows manually overriding specific parts of gener- 
ated lemmas where necessary. We do this for about 100 of the ASL functions and 
instructions, generally taking the form of small patches, e.g. giving additional 
hints to the proof tactics, such as additional simplification rules or loop invari- 
ants, or adding side conditions to lemma statements, such as assumptions about 
capability checks for memory-accessing helper functions. The tool outputs the 
generated lemmas in theory files which are then checked by Isabelle; hence, the 
external tool does not need to be trusted. The proof consists of around 37 000 
generated lines, 8600 manually written lines, as well as 8900 lines for the ab- 
stract model, monotonicity proof, and proof tools. The proof executes in 7hrs 
20mins CPU time on an i7-10510U CPU at 1.80GHz, but only 3hrs 23mins real 
time thanks to parallel execution, with peak memory consumption of 18GB. 
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5.4 Bugs and Issues Found 


Our verification work uncovered several bugs and issues in the ASL specification. 

During our initial SMT-checking of the capability manipulation helper func- 
tions, one issue we discovered that was not known previously was a bug in the 
top-byte normalisation logic of the CapSetBounds function, which could have led 
to some of the top bits of the lower or upper bound of a capability changing when 
modifying some of their lower bits, even if the requested bounds were within the 
original bounds of the input capability, thereby violating monotonicity. 

Our Isabelle proof uncovered a bug in the BranchToCapability function where 
the branch target capability was modified without a check that it is unsealed. 
Hence, branch instructions could have modified sealed capabilities. The result 
would not have been directly available to the code that performed the branch, 
because the modified sealed capability would be installed into PCC, and the 
subsequent instruction fetch would fault with a sealed capability exception, but 
as part of exception handling the modified sealed capability would then have been 
written to the CELR register and become accessible to the exception handler. 

Another issue we found was a case of missing capability checks in the im- 
plementation of the DC ZVA instruction. This would have allowed software to 
overwrite memory regions with zeros without capability authorisation. 

We also found various issues that were already known to Arm, e.g. the STP 
instruction checking the tag of the wrong capability, as well as functional bugs 
not directly affecting our proof of security properties, e.g. a bug in the LDNP and 
STNP instructions where the wrong memory access type was used. 

We reported all of our findings to Arm, and the issues have been fixed. 


6 Validating the Concrete Semantics 


Confidence in our results about Morello’s security properties relies on our trans- 
lation of the specification (from ASL into Sail and Isabelle) accurately reflecting 
the intended architecture. A key part of ensuring that hardware designs imple- 
ment Arm architectures correctly is to test against Arm’s internal Architectural 
Compliance Kit (ACK); to validate our translation we ran a large collection of 
tests from the Morello ACK against a Sail generated C emulator. This approach 
was also taken with an earlier AArch64 Sail model [9]. These tests are typically 
self-contained executables that can be run directly after processor reset without 
an operating system or peripherals, except for a simple serial device for reporting 
results and diagnostic information. Each test executes tens or even hundreds of 
thousands of instructions, so using our fast C emulator was essential. 

The ACK covers Morello-specific functionality alongside the relevant parts of 
the base Arm-v8.2 architecture in more than 25000 tests. Its scope is wider than 
the ASL model, including features such as performance counters, debug, and 
tracing, where the ASL has only interfaces or partial information, leaving the 
detailed specification to prose descriptions. There are also tests for the generic 
interrupt controller (GIC), a distinct system-on-chip component with a separate 
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specification which is not part of the ISA. Moreover, for the Morello-feature 
suites, the “implementation defined” behaviour expected by the tests is more 
constrained than normal to match the single Morello hardware design. 

To manage this complexity we first obtained baseline results from a Morello 
Arm Fast Model simulator, without the additional support normally used in the 
ACK testing environment. This matches the contents of the ASL specification 
more closely. We then excluded tests which required features that are not fully 
modelled, and adjusted the “implementation defined” portions of the specification 
to approximate the hardware. By comparing the results from our Sail generated 
emulator against the baseline we could identify and repair faults in both the ASL 
specification and our translation. Repairing these issues was important both to 
ensure that our understanding of the problem was correct and to ensure that 
tests could run to completion to rule out further issues. 

Specific issues that we encountered involved minutiae about how system reg- 
ister bits behave when features are not present (such as AArch32 instructions), 
a couple of missing cases in our built-in operations used by SIMD instructions, a 
variable shadowing issue in our translation tools, corner cases in the ASL speci- 
fication handling of page table capability tracking, and a few exception handling 
problems. None of these issues affect capability monotonicity. 

The resulting pass rate was 98.1% compared with the baseline. The discrepan- 
cies were mostly due to limitations of the ASL model, such as limited debugging 
support, corner cases in address space handling, and the lack of secure memory; 
a few details with some SIMD instructions and particular processor exceptions 
require further investigation, but again, they do not affect monotonicity. 


7 Model-based Test Generation 


In addition to the ACK, and before we had access to it, we generated a test 
suite from the model to check core instruction and capability functions against 
the implementations; and also to adapt QEMU to support most of Morello. We 
use symbolic execution, well-established as a way to generate high coverage test 
suites [12,43] and used previously for a much simpler CHERI architecture [13], 
both to perturb the initial state to explore different instruction behaviours and 
to control whether processor exceptions are taken. The latter is particularly 
useful for CHERI ISAs because most input values would trivially fault at one 
of the capability checks (e.g. see CheckCapability in Fig. 2). Instruction set 
specifications are good candidates for symbolic execution because the languages 
tend to be relatively simple and the number of paths for any given instruction 
is bounded. To build a test generator for Morello we were able to reuse the Isla 
symbolic execution tool, which was already being developed for work combining 
Sail ISAs with relaxed memory models [10]. 

The test generator operates on traces of instructions, partially or fully chosen 
at random from the encoding diagrams included in the original ASL. Isla’s sym- 
bolic execution was extended with a simple sequential memory model using SMT 
arrays for the main memory and tags. In outline, the generator: 1. initialises the 
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model by running the processor reset function in the symbolic executor (this 
is deterministic and does not involve any symbolic state); 2. alters the state so 
that the parts the test harness can change are symbolic, and fix other values as 
necessary (e.g., for memory translation); 3. symbolically executes each instruc- 
tion in turn to find feasible behaviours and pick one; 4. passes the accumulated 
path conditions to the Z3 SMT solver [16] to find suitable concrete values for the 
initial and final states; and 5. constructs the final test with the instructions and 
the test harness which will set up the initial state and check the final state after 
execution. This harness is hand-written (although automatically producing it in 
the style of Martignoni et al. [29] would be interesting to explore), so to accel- 
erate development we first restricted our attention to fault-free behaviours with 
memory management turned off, then gradually added support for exceptions, 
for a simple fixed memory mapping, and checks of more of the processor state 
after execution. 


Our coverage goal for test generation was to ensure that all of the specifi- 
cation code for manipulating capabilities and for instructions that were added 
or modified for Morello would be executed in some test. This was complicated 
by non-determinism in parts of the specification. Some instructions have “con- 
strained unpredictable” forms which can have one of several effects; e.g., a load- 
pair where both destination registers are the same might write UNKNOWN to them, 
do nothing, or take a fault. In principle allowing for all of these is possible, but 
the resulting disjunctions are likely to be much more difficult to solve, and the 
behaviours themselves are not very interesting, so we discarded these paths. 

Another area of non-determinism in the specification is the load/store ex- 
clusive instructions that are used for synchronisation. Even during single-core 
execution these instructions have such behaviour due to the particular mem- 
ory architecture choices, which are left as unimplemented primitive operations 
in the specification. To test these instructions we added a simple model of the 
guaranteed behaviour in Sail, which includes assertions to avoid uncertain cases. 


While the number of paths to explore in any instruction is bounded, the num- 
ber of paths found for some instructions remains impractically large. The main 
cause is the case splits in the capability compression scheme. We reduce these 
to a single path by pushing the decisions into the SMT solver using Isla’s lin- 
earisation feature, extended to support more of the language, which transforms 
functions with no side effects into a single SMT expression. This was sufficient 
to perform large-scale test generation with the Morello model. 


We checked our progress against our coverage goal using the Sail C back- 
end’s coverage measurement support, counting, for each expression in a Sail 
specification, the number of tests that exercise it. Once we had enough tests 
that the accumulated coverage began to level out, it was apparent that certain 
instructions and corner cases were not exercised enough. Overriding the ran- 
dom instruction choice filled in most of the gaps, and temporarily disabling the 
linearisation allowed exhaustive testing of a key capability function. 

The tests found a few minor issues in our tooling and some more bugs in 
the original ASL specification: several undefined variants of instructions were 
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included, a new load-pair that should have been marked “constrained unpre- 
dictable”, a set-bounds operation could read the wrong register, and a translation 
fault could be missed in a load-tags instruction. Corrections were made to the 
specification for these issues; a couple also arose in one of the implementations 
of Morello, which were then fixed. 

Comparing the coverage of these tests with the ACK is instructive. As we 
used the Sail coverage as a goal, we hit a few gaps in the ACK, such as the 
set-bounds issue, and a rare corner case in a core capability function. However, 
the ACK’s coverage goals included semantic notions that we cannot capture 
easily. For example, if a conditional is supposed to be false because the first 
of three checks will fail, human-authored coverage includes the other checks 
passing, whereas our generator does not reason about the other checks because 
the symbolic execution does not reach them. 

The generated test suite was also used as the basis for test-driven develop- 
ment of an extension of QEMU’s Armv8-A support to Morello. After adding 
basics, such as tagged memory and the expanded register file, the tests guided 
which features to implement, easing development. Small errors were picked up 
automatically, such as confusing the stack pointer and zero registers (which share 
an encoding) and sign extension bugs, including one in the pre-existing QEMU 
code where a previous attempt to fix it had missed a subtle issue. 

The adapted QEMU now boots CheriBSD, a version of FreeBSD with capa- 
bility support, although this required some fixes for issues that were not found 
by the generated test suite. A few involved parts of the state that were not 
explicitly included in the self-test, particularly around exception handling, but 
most of them concerned out-of-scope system features. 


8 Related Work 


Nienhuis et al. [38] proved similar results for the CHERI-MIPS architecture, 
above the Isabelle generated from L3 [23]. CHERI-MIPS is much smaller than 
Morello (6k LoS), and much simpler, without page tables, virtualisation, vector 
instructions, etc. They identified 9 properties of the ISA semantics that sufficed 
to show reachable capability monotonicity and a secure encapsulation result. 
These captured the capability-relevant intentions of instructions explicitly, but 
were expressed in terms of a conventional whole-system semantics, instead of 
the intra-instruction semantics we use here, and that was key to scaling. Each 
instruction had to be annotated with its intention, extensive work was needed 
to prove commutativity results, and the properties were MIPS-specific. 

The other most closely related work, proving properties of capability archi- 
tectures, establishes stronger results but for highly idealised architecture defi- 
nitions. While our monotonicity theorem is about arbitrary machine execution 
up to a domain crossing, Skorstengaard et al. and Georges et al. [46,47,49,48,24] 
establish logical-relation methods for reasoning about combinations of arbitrary 
and known code, the latter mechanised in Iris [28], but for idealised machines 
rather than full architectures. These add new features to help enforcing strong 
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properties, but with unclear hardware implementation cost. Strydonck et al. [50] 
and El-Korashy et al. [19] study secure compilation in similarly idealised settings. 
Ultimately one would like to scale all these methods to production CHERI archi- 
tectures. de Amorim et al. [5,4] verify information-flow properties of their SAFE 
architecture, also for a simplified model. 

Capabilities have also been used in the interfaces of numerous operating sys- 
tems. PSOS [37] uses a similar hardware tag bit to CHERI, but all capability 
operations are implemented in the OS rather than hardware. Various other oper- 
ating system use standard hardware but have capabilities as part of their inter- 
faces. These systems are very different to CHERI, but their security models have 
many similarities. Proofs that a (simplified) OS interface matches an abstract 
capability security model have been done for the EROS OS [45] and for the seL4 
kernel [20]. A subsequent proof connects to the seL4 implementation [44]. Each 
of these abstract models somewhat resembles ours, e.g. with notions of reachable 
and derivable capabilities. Our observation that domain-crossing events create 
extra complications also seems to apply to seL4. 

There is a great deal of work devoted to other approaches to improve mem- 
ory safety which we cannot detail here, but see the review [51]. For just a sam- 
ple, many projects have developed software-implemented variants of C or C++ 
that provide greater safety, but typically with rather different performance and 
code-porting costs to CHERI, and without considering whole-system aspects 
outside a single C/C++ program [25,36,34,35,17,42,21]. Then there are many 
hardware-accelerated approaches, e.g. MPX and WatchdogLite, Watchdog, and 
Hardbound [33,32,31,18]. A different line of work aims at bug-finding rather than 
deterministic mitigation, e.g. AddressSanitizer [2] and many others. 

If widely adopted, Morello would radically change the landscape for such 
work, and for computer security more generally. 
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Abstract. CompCert is the first realistic formally verified compiler: it 
provides a machine-checked mathematical proof that the code it gener- 
ates matches the source code. Yet, there could be loopholes in this ap- 
proach. We comprehensively analyze aspects of CompCert where errors 
could lead to incorrect code being generated. Possible issues range from 
the modeling of the source and the target languages to some techniques 
used to call external algorithms from within the compiler. 


Keywords: Formally Verified Software - The Coq Proof Assistant 


1 Introduction 


CompCert [35,34,36] is a formally verified compiler for a large subset of the C99 
language (extended with some C11 features): there is a proof, checked by a proof 
assistant, that if the compiler succeeded in compiling a C program and that 
program executes with no undefined behavior, then the assembly code produced 
executes correctly with the same observable behavior. Yet, this impressive claim 
comes with some caveats; in fact, there have been bugs in CompCert, some of 
which could result in incorrect code being produced without warning [57]. How 
is this possible? 

The question of the Trusted Computing Base (TCB) of CompCert has been 
alluded to in general overviews of CompCert [37,27], but there has been so far 
no detailed technical discussion of that topic. While our discussion will focus 
on CompCert and Coq, we expect that much of the general ideas and insights 
will apply to similar projects and other proof assistants: other verified compilers, 
verified static analysis tools, verified solvers, etc. 

We analyze the TCB of the official releases of CompCert,! and two forks: 
CompCert-KVX,? adding various optimizations and a backend for the Kalray KVX 
VLIW (very large instruction word) core, and CompCert-SSA,° adding optimiza- 
tions based on single static assignment (SSA) form [6,18]. Versions and changes 


* A software artefact is available from https://doi.org/10.5281/zenodo.5913981 
1 https: //github.com/AbsInt /CompCert 

? https: //gricad-gitlab.univ-grenoble-alpes.fr/certicompil /compcert-kvx 

3 https://gitlab.inria.fr/compcertssa/compcertssa 


© The Author(s) 2022 
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to these software packages are referred to by git commit hashes. We discuss al- 
ternate solutions, some of which already implemented in other projects, their 
applicability to CompCert, as well as related work. 

Sections 2 and 3 analyze the TCB part coming from Coq usage. Section 4 
presents the TCB part connecting the Coq specification of CompCert’s inputs 
(source code) to the user view of these inputs. Sections 5 and 6 analyze the TCB 
part connecting the Coq specification of CompCert’s generated programs to the 
actual platform running these programs. The conclusion (7) summarizes which 
TCB parts of CompCert (and its forks) are the most error-prone, and discusses 
possible improvements. 


2 The Coq Proof Assistant 


CompCert is mostly implemented in Coq,* an interactive proof assistant [2]. Coq 
is based on a strict functional programming language, Gallina, based on the 
Calculus of Inductive Constructions, a higher-order A-calculus. This language 
allows writing executable programs, theorem statements about these programs, 
and proofs of these theorems. CompCert is not directly executed within Coq. In- 
stead, the Coq code is extracted to OCaml code, then linked with some manually 
written OCaml code. We now discuss how issues in the Coq implementation may 
impact the correctness of CompCert. 


2.1 Issues in Coq Proof Checking 


Proofs written directly in Gallina would be extremely tedious and unmaintain- 
able, so proofs are usually built using Coq tactics. While some other proof as- 
sistants trust tactics to apply only correct logical steps, this is not the case with 
Coq: what the tactics build is a A-term, which could have been typed directly in 
Gallina if not for the tedium, and this -term is checked to be correctly typed 
by the Coq kernel. This allows tactics to be implemented in arbitrary ways, 
including calling external tools, without increasing the TCB. 

A theorem statement is proved when a A-term is shown to have the type 
of that statement (the Curry-Howard correspondence thus identifies statements 
and types, and proofs and A-terms). Thus, all logical reasoning in Coq relies on 
the correctness of the Coq kernel, and some driver routines. In addition to the 
Coq compiler coqc and Coq toplevel coqtop, a proof checker coqchk provides 
some level of independent checking. 

Coq is a mature development, however “on average, one critical bug has 
been found every year in Coq” [51]. Let us comment on the official list of these 
bugs.” Interestingly, the list classifies their risk according to whether they can be 
exploited by accident. We can probably assume that the designers of CompCert 
would not deliberately write code meant to trigger a specific bug in Coq and 


t https: //coq. inria.fr / 
5 https: //github.com/coq/coq/blob/master/dev/doc/critical- bugs 
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prove false facts about compiled code: exploiting a Coq bug by mistake in a 
way sufficiently innocuous to evade inspection of the source code, to accept an 
incorrect optimization that would be triggered only in very specific cases (to 
evade being found through testing), seems highly unlikely. 

Proofs are checked by Coq’s kernel, which is essentially a type-checker for 
the A-calculus implemented by Coq (the Calculus of Inductive Constructions 
with universes). There have been a number of critical bugs involving Coq’s ker- 
nel, particularly the checking of the guard conditions (whether some inductively 
defined function truly performs structural induction) and of the universe condi- 
tions (Coq has a countable infinity of type universes, all syntactically called Type, 
distinguished by arithmetic constraints, which must then be checked for valid- 
ity). These conditions prevent building some terms having paradoxical types. 
Furthermore, there are options (in the source code or the command-line) that 
disable checking guard, universe or positivity conditions. For instance, if one dis- 
ables the guard condition to build a nonterminating function as though it were 
a terminating one, it is possible to prove “false”: 


Unset Guard Checking. 
Fixpoint loop {A: Type} (n : nat) {struct n}: A := loop n. 
Lemma false: False. Proof. apply loop. exact 0. Qed. 


coqchk -o lists which guard conditions have been disabled—none in CompCert. 

The Coq kernel can evaluate terms (reduce them to a normal form), but is 
rather slow in doing so. For faster evaluation, it has been extended with a virtual 
machine (vm_compute) [24] and a native evaluator (native_compute) [10]. Both 
are complex machinery, and a number of critical bugs have been found in them.°® 
In CompCert, there is a few direct calls to vm_compute, none to native_compute; 
but there may be indirect calls through tactics calling these evaluators. 


2.2 Issues in Coq Extraction 


Coq’s extractor, as used in CompCert, produces OCaml code from Coq code, 
which is then compiled and linked together with some other OCaml code. Ex- 
traction [39,38], roughly speaking, corresponds to removing non-computational 
(proof) content, compensating for some typing issues (see below), renaming some 
identifiers (due to different reserved words), and of course printing out the result. 
Coq’s extractor and OCaml are in the TCB of CompCert. 

OCaml’s type safety ensures that, barring the use of certain features that 
circumvent this type safety (unsafe array accesses, marshaling, calls to external 
C functions, the Obj module allowing unsafe low-level memory accesses. ..), no 
type mismatch or memory corruption can happen at runtime within that OCaml 
code. None of these features are used within CompCert, except for calling C 


6 For instance, there used to be a bug with respect to types with more than 255 con- 
structors that allowed proving “false” https://github.com/clarus/falso, so ludicrous 
that it made it into a satirical site https: //inutile.club/estatis/falso/. 
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functions implementing the OCaml standard library, and some calls to Obj .magic, 
a universal unsafe cast operator, produced by Coq’s extractor. 

Calls to Obj.magic are used by the extractor to force OCaml to accept con- 
structs (dependent types, arbitrary type polymorphism) that are correctly typed 
inside Coq but that, when mapped to OCaml types, result in ill-typed programs. 
The following program is correct in Coq (or in System F) but cannot be typed 
within OCaml’s Hindley-Milner style of polymorphism, so uses 0bj.magic:” 


Definition m (g : V {T}, list T— list T) : Type = 
((g (false :: nil)), (g (O :: nil))). Extraction m. 


The following program, which is similar to some code in the BuiltinsO.v Com- 
pCert module, uses dependent types 


Inductive data := DNat : nat — data | DBool : bool — data. 
Definition get_type (d : data) : Type = 

match d with DNat _ => nat | DBool _ = bool end. 
Definition extract (d : data) : get_type d := 

match d with DNat n > n | DBool b > b end. 
Require Extraction. Extraction extract. 


Its extraction uses Obj .magic:® 


let extract = function DNat n -> Obj.magic n 
| DBool b -> Obj.magic b 


Thus, incorrect behavior in the Coq extractor could, in theory at least, pro- 
duce OCaml code that would not be type-safe, in addition to producing code not 
matching the Coq behavior. Is this serious cause for concern? On the one hand, 
the extraction process is quite syntactic and generic. It seems unlikely that it 
could produce valid OCaml code that would compile, pass tests, yet occasionally 
would have subtly incorrect behavior.? On the other hand, CompCert is perhaps 
the only major project using the extractor, which is thus not thoroughly tested. 
We do not know of any extractor bug that could result in CompCert miscompil- 
ing. Another related potential source of bugs comes from the link of OCaml code 
extracted from Coq and “external” OCaml code. This is discussed in Section 3.2. 

Sozeau-et-al [51] study an approach to reduce the TCB of Coq by providing 
a formally verified (in Coq) implementation of a significant subset of its ker- 
nel and paving the road for a formally verified extraction. However, the target 
language of the extraction (OCaml ?) would still be in the TCB. An alterna- 
tive solution would be direct generation of assembly code from Gallina, as done 
by Œuf [42]; however parts of CompCert are currently written in OCaml and 
would have to be rewritten into Gallina. Œuf extracts Gallina to Cminor, one of 


T Some System F-like polymorphism was added to OCaml: structure types with poly- 
morphic fields. This is not used by Coq’s extractor as of Coq 8.13.2. 

8 Variants of this example correspond to general algebratic data types (GADTs), an- 
other recent addition to OCaml’s type system not yet exploited by the extractor. 

° Coq’s bug tracker lists extractor bugs that, to the best of our knowledge, result in 
programs that are rejected by OCaml compilers. 
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the early intermediate languages of CompCert, then produces code using Com- 
pCert.19 CertiCoq! [45,44] also extracts to Clight, which may be compiled with 
any C compiler. 


3 Use of Axioms in Coq 


Coq, as other proof assistants, checks that theorems are properly deduced from 
a (possibly empty) set of axioms. Axioms are also introduced as a mechanism 
to link Gallina programs to external OCaml code through extraction. Improper 
use of axioms may lead to two forms of inconsistency: logical inconsistency and 
inconsistency between the Coq proof and the OCaml external code. 


3.1 Logical Inconsistency 


Coq is based on type theory, with logical statements seen through the Curry- 
Howard correspondence: a proof of a logical statement is the same thing as a 
program having a certain type. In other words, a theorem is proved if and only 
if there is a A-term inhabiting the type corresponding to the statement of the 
theorem. An axiom is thus just the statement that a certain constant, given 
without definition, inhabits a certain type. 

The danger of using axioms is that they may introduce inconsistency, that is, 
being able to prove a contradiction; from which, through ez falso quodlibet, any 
arbitrary statement is provable. Furthermore, it is possible that several axioms 
are innocuous individually, but create inconsistency when added together. 

There are several common use cases for axioms in Coq. One is being able 
to use modes of reasoning that are not supported by Coq’s default logic: Com- 
pCert!? adds the excluded-middle (VP, P V =P) for classical logic, functional 
extensionality (f = g if and only if Vz, f(a) = g(a)), and proof irrelevance 


10 Other systems meant to generate code from definitions in a proof assistant, generate 
code directly rather than reuse an existant compiler. This approach is promoted [31] 
with the argument that such a process is safer than textual extraction to, say, OCaml. 
This is not so clear to us. On the one hand, extracting (without proof of correctness) 
Gallina to a subset of OCaml, printing the result, then running the OCaml compiler, 
surely adds a lot to the TCB. On the other hand, it is typically difficult to get right in 
a compiler the modeling of the assembly instructions, the ABI, the foreign function 
interface, as discussed in Section 5. Bugs at that level are caught by extensive testing. 
Surely, the OCaml code generator, the many libraries using OCaml’s foreign function 
interface, are more thoroughly tested by usage than a code generator used to extract 
a few specific projects developed in a proof assistant. 

11 https: //github.com/CertiCoq/certicog 

12 CompCert module Axioms.v imports module FunctionalExtensionality from the 
Coq standard library, which both states functional extensionality and states proof 
irrelevance as axioms. Some CompCert modules import the standard Classical 
module, which states excluded-middle as an axiom. Since proof irrelevance is a 
consequence of excluded-middle, it should be possible to just import Classical 
in Axioms.v and deduce proof irrelevance from it. 
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(one assumes that the precise statement of a proof as a \-term is irrelevant). 
Meta-theoretical arguments have shown that these three axioms do not introduce 
inconsistencies. ! 

Another use case for axioms is to introduce names for types, constants and 
functions defined in OCaml, with a relationship between these and those of the 
OCaml types and functions to be specified for Coq’s extraction facility. For in- 
stance, to call an OCaml function f: nat -> bool list one would use 


Axiom f: nat — list bool. Extract Inlined Constant f>"f". 


This is used extensively in CompCert, to call algorithms implemented in OCaml 
for efficiency, using machine integers and imperative data structures; see 3.3 
Similarly, one can refer to an OCaml constant as follows! 


Axiom size : nat. Extract Inlined Constant size => "size". 


Incorrect use of axioms to be realized through extraction can lead to logical 
inconsistency. Consider, for instance this variant, where the size external defini- 
tion is supposed to be a negative natural number (maybe because we mistakenly 
typed n < 0 instead of n < 10); one can easily derive False from it: 


Axiom size : { n : nat | n < 0}. 


One approach for avoiding such logical inconsistencies is to avoid axioms that 
specify types carrying logical specifications, that is, proofs (e.g., here n < 0); 
this is anyway a good idea, because such types may also result in mismatches 
(see 3.2). No OCaml function in CompCert accessed from Coq has Coq type 
carrying logical specification, with one exception, in CompCert-KVX: 


Axiom profiling_id: Type. 
Axiom profiling_id_eq: V (x y : profiling_id), {x=y} + {x<>y}. 


These axioms state that there exists a type called profiling_id fitted with a 
decidable equality, both of which are defined in OCaml. This decidable equality 
is a technical dependency of the decidable equality over instructions. 

In order to avoid logical inconsistencies due to axioms referring to external 
definitions, one can prove that the type in which the Axiom command states that 


13 There is a model of Coq’s core calculus in Zermelo-Fraenkel set theory with the 
Axiom of Choice and inaccessible cardinals [32,53]. Such a model is compatible 
with these axioms. Previously, in times when Coq’s Set sort was impredicative (it 
can still be selected to be so by a command-line option), it became apparent that 
this was incompatible with excluded-middle and forms of choice suitable for finding 
representatives of quotient sets [15,16]. This should be a cause of caution, though 
we think it unlikely to exploit such paradoxes by accident. 

This may allow compiling a Coq development once (Coq compilation may be ex- 
pensive, certain proofs take a lot of time) and then adjust some constants when 
compiling and linking the extracted OCaml code, maybe for different use cases. This 
is not used in CompCert, which, instead for flexibility, allows certain features to be 
selected at run-time through command-line options. 


14 


210 David Monniaux, Sylvain Boulmé 


there exists a certain term is actually inhabited; this establishes that the axiom 
does not introduce inconsistency. For instance, one can specify an OCaml con- 
stant n < 10, to be resolved at compile-time, and exclude logical inconsistency 
by showing that such a constant actually exists: 


Axiom size : { n : nat | n < 10 }. 
Lemma size_can_exist: { n : nat | n < 10 }. 
Proof. exists 0; lia. Qed. 


This approach is occasionally used in Coq and CompCert for axiomatizing alge- 
braic structures. For instance, Coq specifies constructive reals axiomatically, then 
provides an implementation that satisfies that specification; CompCert-KVX’s im- 
pure monad (discussed in Section 3.3) is specified axiomatically, but the authors 
provide several implementations satisfying that specification [11]. Similarly, the 
authors could have provided an implementation of profiling_id (e.g., natural 
numbers) and profiling_id_eq to show that these two axioms did not introduce 
logical inconsistencies. 


3.2 Mismatches between Coq and OCaml 


Though safe, the extractor can be used inappropriately. We have just seen that 
adding an axiom standing for an OCaml function can, if that axiom is not realiz- 
able in Coq, lead to logical inconsistency. Even if the axiom is logically consistent, 
extraction to arbitrary OCaml code can lead to undesirable runtime behavior. 
An obvious case is when, in addition to an axiom specifying a constant re- 
ferring, at extraction time, to an OCaml function, one adds an axiom specifying 
the behavior of that function, and that behavior does not match the specifica- 
tion. For instance, one can specify f to be a function returning a natural number 
greater than or equal to 3, then, through extraction, define it to return 0: 


Axiom f : nat > nat. Axiom f_ge_3 : V x, (f x) > 3. 
Definition g x := Nat.leb 1 (f x). 
Extract Constant f > "fun x > 0". 


Unsurprisingly, it is possible to prove in Coq that g always returns true, and 
yet to run the OCaml code and see that it returns false. It is similarly possi- 
ble to write Coq code with impossible cases that the extractor will extract to 
assert false, and the extracted code will actually reach this statement and die 
with an uncaught exception—an after all better outcome than producing out- 
put that contradicts theorems that have been proved. In the following code, 
False_rec _ _ eliminates on False, which is obtained from contradiction with 
x > 3, and is extracted to an always failing assertion. 


Program Definition h x := match f x with 
| 0 => False_rec _ _ | S 0 => False_rec 
| S (S 0) = False_rec _ L | S (S (S x)) => x 
end. 
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Axiomatizing the behavior of externally defined functions circumvents the 
idea of verified software; nowhere in the CompCert source code is there such 
axiomatization. An equivalent but perhaps more discreet way of axiomatizing 
the behavior of OCaml function is through dependent types. Consider, again, 


Axiom size : { n : nat | n < 10 }. 


It is possible, through extraction mechanisms, to bind size to the OCaml con- 
stant 11; this is because the type of size is extracted to the same exact OCaml 
type as nat, the proof component is discarded. It is then possible to similarly 
lead the OCaml code extracted from Coq to cases that should be impossible. 

The only case of such axiomatization, in CompCert-KVX, is the previously 
introduced profiling_id_eq axiom, which is bound to the Digest . equal function 
from OCaml’s standard library, and defined to be string equality. We can surely 
assume that OCaml’s string equality test to be correct, otherwise many things 
in Coq and other tools used to build CompCert are likely incorrect as well. 

It is also possible to instruct the extractor to extract certain Coq types to 
specific OCaml types, instead of emitting a normal declaration for them. The 
main use for this is to extract Coq types such as list or bool to the correspond- 
ing types in the OCaml standard library, as opposed to introducing a second 
list type, a second Boolean type; this is in fact so common that the standard 
Coq.extraction.ExtrOcamlBasic specifies a number of such specific extrac- 
tions, and so does CompCert. This is not controversial. The extractor also allows 
fully specifying how a Coq type maps to OCaml, including the constructor and 
“match” destructor; the only use of this feature in CompCert is in CompCert-KVX 
for implementing some forms of hash-consing (Sec. 3.4). 

An in-depth discussion of further aspects of Coq/OCaml interfacing may be 
found in Boulmé’s habilitation thesis [11]. 


3.3 Interfacing External Code as Pure Functions 


Coq is based on a pure functional programming language; as in mathematics, 
if the same function gets called twice with the same arguments, it returns the 
same value. OCaml is an impure language, and the same function called with 
the same arguments may return different values over time, whether it depends 
on mutable state internal to the program or on external calls (user input, etc.). 
By binding Coq axioms to impure functions, we can, again, lead OCaml code 
extracted from Coq to places it should not go. 

For instance, the z Boolean expression extracted from this Coq program is 
false though it is proved to be true: it calls the same function twice with the 
same argument and compares the result!°; but since that function is impure 
and returns the value of a counter incremented at each call, two successive calls 
always return unequal values. 


15 This result is computed by the “Nat .eqb” Boolean equality over naturals (in contrast, 
the Coq propositional equality, written “=”, is only logical). 
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Axiom f: unit — nat. 
Extract Constant f => 
"let count = ref O in fun () 4 count = S (!count); !count". 
Definition z: bool := Nat.eqb (f tt) (f tt). 
Lemma ztrue: z = true. 
unfold z; rewrite Nat.eqb_refl; congruence. 
Qed. 


CompCert calls a number of OCaml auxiliary functions as pure functions, 
most notably the register allocator. These functions are “oracles”, in the sense 
that they are not trusted to return correct results; their results are used to guide 
compilation choices, and may be submitted to checks. Both CompCert-SSA and 
CompCert-KVX add further oracles. 

Could impure program constructs, in particular mutable state, in these ora- 
cles, lead to runtime inconsistencies? The code of some of these oracles is simple 
enough that it can be checked to behave overall functionally: mutable state, if 
any, is created locally within the function and does not persist across function 
calls. In the register allocator, there are a few global mutable variables (e.g., 
max_age, max_num_eqs), and perhaps it is possible to obtain different register al- 
locations for the same function by running the allocator several times. It seems 
unlikely that some CompCert code would intentionally call a (possibly computa- 
tionally expensive) oracle twice with same inputs, then go to an incorrect answer 
if the two returned values differ. Yet, it is not obvious that this cannot happen. 

To avoid such uncertainties, the CompCert-KVX authors encapsulated some 
of their oracles, in particular oracles used within simulation checkers by symbolic 
execution [48,47,49], inside the may-return monad of [11]. The monad models 
nondeterministic behavior: the same function may return different values when 
called with the same argument without leading into inconsistent cases. Beyond 
soundness, a major feature of this approach is to provide “theorems for free” 
about polymorphic higher-order foreign OCaml code. In other words, this ap- 
proach ensures for free (i.e., by the OCaml typechecker) that some invariants 
proved on the Coq side are preserved by untrusted OCaml code [11]. While 
this technique has been intensively applied within the Verified Polyhedron Li- 
brary [12], it is only marginally used within the current CompCert-KVX, only for 
a linear-time inclusion test between lists. 

This approach however has two drawbacks. Firstly, despite the introduction 
of tactics based on weakest liberal precondition calculus, the proof effort is heav- 
ier than for code written with pure functions without a monadic style. Secondly, 
all the code calling impure functions modeled within the may-return monad also 
becomes impure code modeled within that monad, meaning that a significant 
part of the rest of CompCert (at least the code calling the sequence of optimiza- 
tion phases and their proofs) would have to be rewritten using that monad.!° 


16 Much of CompCert is already written in an error monad, with respect to which, 
the may-return monad is a straightforward generalization. It thus seems feasible to 
rewrite CompCert with the may-return monad instead of the existing error monad. In 
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CompCert’s Coq code accesses mutable variables storing command-line op- 
tions through helper functions. This supposes that these variables stay constant 
once the command line has been parsed, which is the case. 

In Coq, all functions must be shown to be terminating (because nontermi- 
nating terms can be used to establish inconsistencies). Arguments for the ter- 
mination of a function are sometimes more intricate and painful to write in 
Coq than those for its partial correctness, and termination is not really useful 
in practice: from the point of view of the end-user there is no difference be- 
tween a terminating function that takes prohibitively long time to terminate, 
and a nonterminating function. For this reason, some procedures in CompCert 
and forks that search for a solution to a problem (e.g., a fixpoint of an operator) 
are defined by induction on a positive number, and return a default or error 
value if the base case of the induction is reached before the solution is found. 
Iteration.PrimIter, used for instance in the implementation of Kildall’s fix- 
point solving algorithm for dataflow analysis, thus uses a large positive constant 
num_iterations=10!*. Such numbers are often informally known as fuel. 

CompCert-SSA takes an even more radical view: a natural number fuel is 
left undefined, as an axiom, inside the Coq source code, and is extracted to 
OCaml code let rec fuel = S fuel, meaning that fuel is circularly defined as 
its own successor, and in practice acts as an infinite stream of successors. Why 
that choice? num_iterations is a huge constant belonging to the positive type, 
which models positive integers in binary notation; there is a custom induction 
scheme for this type that implements the usual well-founded ordering on posi- 
tive integers. In contrast, fuel is a natural number in unary notation, on which 
inductive functions may be defined by structural induction, which is a bit easier 
than with a custom induction scheme; but it is impossible to define a huge con- 
stant in unary notation. The num_iterations scheme is cleaner, but we have not 
identified any actual problem with the fuel scheme. The OCaml code extracted 
from Coq has no way to distinguish fuel from a large constant. 

The fuel trick however breaks if pointer equality is exposed on the natural 
number type [11]. The following program, defined using a “may return” monad, 
where phys_eq_nat is pointer equality on natural numbers, can be proved not to 
return true; yet, it does return true at runtime. 


Definition fuel_eq_pred := 
match fuel with 
| 0 = Impure.ret false 
| S x = phys_eq_nat fuel x 
end. 


practice, this represents a lot of reengineering work. For example, currently, the may- 
return monad provides a tactic in backward reasoning, based a weakest-precondition 
calculus. In contrast, CompCert provides a tactic for forward reasoning on the error 
monad. Thus, defining a tactic on the may-return monad that behaves like the one of 
the error monad would help in reducing the amount of changes in CompCert proofs. 
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3.4 Pointer Equality and Hash-Consing 


The normal way in Coq to decide the equality of two tree-like data structures 
is to traverse them recursively. The worst-case of this approach is reached when 
the structures are equal, in which case they will be traversed completely. Un- 
fortunately this case is frequent in many applications for verified compilation, 
verified static analysis, etc.: when the data structures represent abstract sets of 
states (in abstract interpretation), equality signals the equality of these abstract 
sets, which indicates that a fixed point is reached; equality between symbolic 
expressions is used for translation validation through symbolic execution [48]. 
Furthermore, there are many algorithms that traverse pairs of tree-like struc- 
tures for which there are shortcuts if two substructures are equal: for instance, if 
this algorithm computes the union of two sets, then if these sets are equal, then 
the union is the same [41, §5]; being able to exploit such cases has long been 
known to be important for the speed of static analyzers [8, §6.1.2]. 

If we were programming in OCaml, we could simply use pointer equality (==) 
for a quick check that two objects are equal: if they are at the same memory 
location, then they are necessarily structurally equal (the converse is not true in 
general). In Coq, a naive formalization of this approach could be: 


Parameter A: Type. 
Axiom phys_eq: A —> A —> bool. 
Axiom phys_eq_implies_eq: V x y, phys_eq x y = true > x = y. 


This approach is however unsound.!” We prove that x_eq_x and x_eq_y are 
equal; yet in the extracted code, the former evaluates to true, the second to false. 


Definition x =S 0- (* 1 *) Definition y =S O. (* 1 *) 
Definition x_eq_x=phys_eq x x. Definition x_eq_y=phys_eq x y. 


Extract Inlined Constant phys_eq > "E3". 
Recursive Extraction x_eq_x x_eq_y. 
Lemma same : xX_eq_x = x_eq_y. Proof. reflexivity. Qed. 


To summarize, OCaml pointer equality can distinguish two structurally equal 
objects, whereas this is provably impossible for Coq functions: for Coq, x and 
y are the same, so they are interchangeable as arguments to phys_eq. This is 
the functionality issue of Section 3.3 in another guise: the same OCaml function 
must be allowed to return different values when called with the same argument. 

The solution used in CompCert-KVX for checking that symbolic values are 
equal was thus to model pointer equality as a nondeterministic function in a 
“may return” monad. In this model [11], pointer equality nondeterministically 


17 We saw in the preceding section another possible cause of unsoundness: if circular 
data structures are defined in OCaml inside inductive types, pointer equality can 
be used to establish that a term is equal to one of its strict subterms, which is 
normally impossible, thus leads to an absurd case at execution time. To avoid this, 
either completely disallow linking to circular terms constructed in OCaml, or restrict 
pointer equality test to types where such circular terms are not constructed. 
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discovers some structural equalities.'® This solution has one drawback: the whole 
of the symbolic execution checker is defined within this monad, and the authors 
unsafely exit from that monad to avoid running much of CompCert through it. It 
is uncontroversial that pointer equality implies equality of the pointed objects. 
The only cause for unsoundness in such an approach could be the unsafe exit. 
Yet, again, why would CompCert-KVX call twice the symbolic execution engine 
with the same arguments to reach an absurd case for different outcomes? 

Opportunistic detection of identical substructures through pointer equality 
was implemented for instance in Astrée [8]. This approach takes advantage of the 
fact that many algorithms operating on functional data structures simply copy 
pointers to parts of structures that are left intact: The opportunistic approach 
detects that some parts of structures have been left untouched, skipping costly 
traversals. It however does not work if a structure is reconstructed from scratch, 
for instance as the result of a symbolic execution algorithms: if two symbolic 
executions yield the same result, these results are defined by isomorphic data 
structures but the pointers are different. What is needed then is hash-consing: 
when constructing a new node, search a hash-table containing all currently ex- 
isting nodes for an identical node and return it if it exists, otherwise create a new 
node and insert it into the table. Hash-consing is widely used in symbolic com- 
putation, SMT-solvers etc.; there exist libraries making it easy in OCaml [19], 
and the OCaml standard library contains a weak hash-table module, one of the 
main uses of which is being a basic block for hash-consing. 

The difficulty is that, though overall the construction of new objects behaves 
functionally (it returns objects that are structurally identical to what a direct 
application of a constructor would produce), it internally keeps a global state 
inside the hash-table. Several solutions have been proposed to that problem [14]; 
one is to keep that global state explicitly inside a state monad, which amounts 
to threading the current state of the hash table through all computations. In 
the original version from [14], this implied implementing the hash-table by em- 
ulating an array using functional data structures, which was very inefficient. 
Coq 8.13 introduced primitive 63-bit integers and arrays (with a functional in- 
terface), optimized for cases where the old version of an updated array is never 
used anymore [17, §2.3], which, through special extraction directives, may be 
extracted to OCaml native integers and arrays. That solution was not adopted 
for CompCert-KVX, only because Coq 8.13 had not yet been released when the 
project started. Instead, CompCert-KVX has experimented with two alternative 
approaches for hash-consing. 

The first approach used in CompCert-KVX introduces an untrusted OCaml 
function (modeled as a nondeterministic function within the may-return monad) 
that constructs terms through the hash-consing mechanism (searching in the 
hash-table etc.); these terms are then quickly checked for equivalence with the de- 
sired terms, using a provably correct checker. For instance, if a term c(a1,...,@n) 
is to be constructed, and the function returns a term t, then the root constructor 


18 Tn this model, a given Coq term is not necessarily equal to “itself” for pointer equality, 
because, in a Coq proposition, “itself” implicitly means a structural copy of “itself”. 
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of t is checked to be c, then the arguments to that constructor are checked to 
be equal to a1,...,@, by pointer equality.!° This solution does not add any- 
thing to the trusted computing base, apart from pointer equality. A may-return 
monad is used because the OCaml code is untrusted, and in particular is not 
trusted to behave functionally. The drawback is that, though the OCaml code 
will always make sure that there are never two identical terms in memory at 
different pointer addresses, this is not reflected from the point of view of proofs: 
in the Coq model (discussed above) of pointer equality within the may-return 
monad, pointer equality implies structural equality, but structural equality does 
not imply pointer equality. However, only the former is needed for a symbolic 
execution engine that checks that two executions are indeed equivalent by struc- 
tural equality of terms, as in the scheduler in CompCert-KVX [48]. 


Having to thread a whole computation through a monad, further adding to 
proof complexity, for actions that are expected to behave functionally overall, is 
onerous. One solution is to add hash-consing natively inside the runtime system; 
for instance, the GimML language,”° from the ML family [23,22,21], automat- 
ically performs hash-consing on datatypes on which it is safe to do so, which 
is for instance used to implement efficient finite sets and maps. This can be 
emulated by a “smart constructor” approach [14], replacing, through the ex- 
traction mechanism, calls to the term constructor, term pattern matching, and 
term equality by calls to appropriate OCaml procedures: the constructor per- 
forms hash-consing, the pattern matcher performs pattern matching ignoring 
the internal-use “unique identifier” field used for hash-consing, and term equal- 
ity is defined to be pointer equality; appropriate OCaml encapsulation prevents 
manipulation of these terms except through these three functions, and in par- 
ticular prevent them from being constructed by other methods than the smart 
constructor. Assuming that this OCaml code is correct, this is indeed sound, due 
to the global invariant that there never exist two distinct yet structurally iden- 
tical terms of the hash-consed type currently reachable inside memory. Because 
terms can only be built using the smart constructor, and that hash-consing en- 
sures that pointer equality is equivalent to structural equality, pointer equality 
can indeed be treated as a deterministic function, without need for a monad. 
This approach has the benefit of an easy-to-understand interface and simple 
proofs; this was the second approach experimented within CompCert-KVX and 
was used for the HashedSet module [41]. 


This second approach adds significantly more OCaml code to the trusted com- 
puting base than just assuming that pointer equality implies structural equality. 
Yet, this OCaml code is small, with few execution paths, and can be easily tested 
and audited. It assumes the correctness of OCaml’s weak hash-tables; however, 
Coq’s kernel includes a module (Hashset) that is also implemented using these 
weak hash-tables, so one already assumes that correctness when using Coq. 


19 A unique identifier is added as an extra field to each object, for reasons including 
efficient hashing. Structural equality is thus modulo differences in unique identifiers. 
20 https: //projects.Isv.fr/agreg/?page_id=258 Formerly HimML. 
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CompCert parses C and assigns a formal semantics to it. As such, it depends 
on a formal model of the C syntax and a formal semantics for it, supposed to 
reflect the English specification given in the international standard [4]. CompCert 
supports an extensive subset of C99 [3] (notable missing items are variable-length 
arrays and some forms of unstructured branching, à la Duff’s device) and some 
C11 features (note that in C11, support for variable-length arrays is optional).?! 

The formal semantics of C supported by CompCert is called “CompCert C”. 
Converting the source program, given in a text file, to the CompCert C AST 
(abstract syntax tree) on which the formal semantics is defined, relies on many 
nontrivial transformations: preprocessing, lexing (lexical analysis), parsing (AST 
building) and typechecking. Most of them are unverified, but trusted. There are 
two important exceptions: significant parts of the parser and the typechecker of 
CompCert C are formally verified. The formally verified parser is implemented 
using the Menhir parser generator, and there is a formal verification of its cor- 
rectness with respect to an attribute LR(1) grammar [25]. It relies on an un- 
verified “pre-parser” to distinguish identifier types introduced by typedef from 
other identifiers (a well-known issue of context-free parsing of C programs). It 
produces an AST which is then simplified and annotated with types, by an- 
other unverified pass, called “elaboration”. Finally, the resulting CompCert C 
program is typechecked, by the formally verified typechecker. This is where the 
fully verified frontend of CompCert really starts. 

Obviously, a divergence between the semantics of C as understood by Com- 
pCert and that semantics as commonly understood by programmers to be com- 
piled may lead to problems. Validating such semantics is an important issue [9]. 
The standard has evolved over time for taking into account common program- 
ming practices or for solving some contradictions.?? CompCert semantics has also 
evolved to get closer to the standard, see [30]. In the last years, a few minor di- 
vergences have been spotted. For instance, there was a minor misimplementation 
of scoping rules (commit 99918e4) that led the following program to allocate s 
of size 3 (sizeof(t) being interpreted with t the global variable, whereas the 
standard mandates it should refer to the t variable declared before it on the 
same line) instead of 4: 


char t[]={1,2,3}; 
int main() { char t[]={1,2,3,4}, s[sizeof(t)]; 
return sizeof(s); } 


Another example: CompCert and other compilers accepted some extension to the 
syntax of C99 (anonymous fields in structures and unions) but assigned slightly 
different meanings to it (different behavior during initialization, issue 411). 


21 The CH20 project (https: //robbertkrebbers.nl/research/ch2o0/) aims at formalizing 
the ISO C11 standard in Coq. This development is unrelated to the formalization 
inside CompCert. 

22 See an example on http://www.open-std.org/jtcl/sc22/wg14/www/docs/dr_260.htm. 
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The C standard leaves many behaviors undefined—anything can happen if 
the program exercises such a behavior (the compiler may refuse the program, the 
program may compile and run but halt abruptly when encountering the message, 
or may continue running with arbitrary behavior). Some undefined behaviors, 
such as array access out of bounds, are exploited in malicious attacks. The 
C standard also leaves many behaviors unspecified, meaning the compiler may 
choose to implement them arbitrarily within a certain range of possibilities— 
e.g., the order of evaluation of parts of certain expressions with respect to side 
effects.?° Actually, distinguishing between unspecified and undefined behavior in 
the evaluation order is rather complex: see [29] for a formal semantics. Further- 
more, many compilers implement extensions to the standard. Some deviate from 
the standard’s mandated behavior in some respects.4 

Many programs, be them applications, libraries or system libraries, rely on 
the behavior of the default compiler on their platform (e.g., gcc on Linux, clang 
on MacOS, Microsoft Visual Studio for Windows).?° If compilation just fails, 
then issues are relatively easy (though maintaining support for multiple compil- 
ers, often through conditional compilation and preprocessor definitions, is error- 
prone); subtler problems may be encountered when software compiles but has 
different behavior with different compilers.?° It may be difficult to narrow dif- 
ferences in outcomes to a bug (including reliance on undefined behavior) or to 
a difference in valid implementations of unspecified behavior. 

The only semantic issue that we know of regarding CompCert’s forthcoming 
version 3.10 is with respect to bitfields. A write to a bitfield is implemented us- 
ing bitshift and bitwise Boolean operations, and these operations produced the 
“undefined” value if one of their operands is “undefined”. Writing to a bitfield 
originally stored in an uninitialized machine word or long word, which is the 
case for local variables, thus results in an “undefined” value, whereas the bits 
written to are actually defined. Reading from that bitfield will then produce the 
“undefined” value, as can be witnessed by running the program in CompCert’s 
reference interpreter, which stops complaining of undefined behavior. Fixing this 
issue would entail using a bit-wise memory model (issue 418).?” It may be pos- 


23 This should not be confused with syntactic associativity, which is fully defined by 
the standard. 

24 For instance, Intel’s compiler, at least at some point, deliberately deviated from stan- 
dard floating-point behavior to produce more efficient code. An option was needed to 
get standard compliance. In contrast, gcc would by default comply with the standard, 
and enable optimizations similar to Intel’s when passed options such as -ffast-math 
or the aptly-named -funsafe-math-optimizations [40]. 

25 On Linux, compiling software with gcc -std=c99, which disables some GNU-specific 
extensions, often fails. On the KVX, CompCert-KVX includes a kludge for defining 
a __int128 type suitable enough for processing system header files. 

26 As an example, C compilers are allowed to replace a*b+c by a fused multiply-add 
fma(a, b, c), which may produce slightly different results. Such replacements may 
be disabled by a command-line option or a pragma. 

7 Questions of “undefined” and “poison” values are notoriously difficult to get right 
in semantics; see [33] for a discussion of intricate bugs in LLVM. 
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sible to write and prove correct a phase that would replace this “undefined” 
value by an arbitrary value and thus result in miscompilation. We do not know, 
however, of any phase that would produce this in CompCert or variants. 

CompCert-KVX’s test suite includes calling compiler fuzzers CSmith?> and 
YarpGen:?° random programs are generated, compiled with gcc and CompCert- 
KVX and run on a simulated target—an error is flagged if final checksums diverge. 

Due to possible semantic differences for the subset of the C language between 
the tools that they use for their formal proofs and CompCert, Gernot Heiser, 
lead designer of the seL4 verified kernel, argues that translation validation of the 
results of black-box compilation by gcc is a safer route: 


[...] using CompCert would not give us a complete proof chain. It uses 
a different logic to our Isabelle proofs, and we cannot be certain that its 
assumptions on C semantics are the same as of our Isabelle proofs. 


Another option, for C code produced from a higher-level language by code 
generators, is to replace CompCert’s frontend by a verified a code generator for 
that language, directly targeting one of CompCert’s intermediate representations 
(e.g., Clight) and semantics, as done for instance for Velus [13] for a subset of 
the Lustre synchronous programming language. 

Some features of the C programming language are not supported by Com- 
pCert’s formally verified core, but can be supported through optional unveri- 
fied preprocessing, chosen by common line options: -fstruct-passing allows 
passing structures (and unions) as value as parameters to functions, as well as 
returning them from a function;°° -fbitfields allows bit fields in structures.*+ 
Preprocessing implements these operations using lower-level constructs (mem- 
ory copy builtin, bit shift operators), sometimes in ways incompatible with other 
compilers—CompCert’s manual details such incompatibilities. 

In addition, option -finline-asm allows inline assembly code with param- 
eter passing, in a way compatible with gcc (implementing a subset of gcc’s 
parameter specification). The semantics of inline assembly code is defined as 
clobbering registers and memory as specified, and emitting an externally ob- 
servable event. Option -fall activates structure passing, bitfields, and inline 
assembly, for maximal compatibility with other compilers. 


28 https: //github.com/csmith-project/csmith and [57] 

29 https: //github.com/intel/yarpgen 

30 In C, passing pointers to structures that container parameters or are meant to con- 
tainer return values is a common idiom. The language however also allows passing or 
returning the structures themselves, and this is implement in various ways by com- 
pilers, including passing pointers to temporary structures or, for structures small 
enough to fit within a (long) machine word, directly as an integer register. How 
to do so on a given platform is specified by the ABI.Parameter passing, with all 
particular cases, may be a quite delicate and convoluted part of the ABI. 

Recently, direct verified handling of bitfields was added to CompCert (com- 
mit d2595e3). This should be available in release 3.10. 
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Because inline assembly is difficult to use,?? and because its semantics in- 
volves emitting an event, preventing many optimizations, CompCert also pro- 
vides builtin functions that call specific processor instructions. If a builtin has 
been given an arithmetic semantics, then it can be compiled into arithmetic 
operators suitable for optimization; this is the case, for instance, of the “fused 
multiply add” operator on the KVX.In contrast, instructions that change special 
processor registers are defined to emit observable events. 


5 Assembly back-end issues 


The verified parts of CompCert do not output machine code, let alone textual 
assembly code. Instead, they construct a data structure describing a set of global 
definitions: variables and functions; a function contains a sequence of instructions 
and labels. The instructions at that level may be actual processor instructions, 
or pseudo-instructions, which are expanded by unverified OCaml into a sequence 
of actual processor instructions. The resulting program is printed to textual as- 
sembly code by the TargetPrinter module; most of it consists in printing the 
appropriate assembly mnemonic for each instruction, together with calling func- 
tions for printing addressing modes and register names correctly, but there is 
some arcane code dealing with proper loading of pointers to global symbols, 
printing of constant pools, etc. Some of this code depends on linking peculiari- 
ties and on the target operating system, not only on the target processor. 


5.1 Printing Issues 


An obvious source of potential problems is the huge “match” statement with 
one case per instruction, each mapping to a “print” statement. If the “print” 
statement is incorrect, then the instruction printed will not correspond to the one 
in the data structure. Printing an ill-formed instruction is not a serious problem, 
as the assembler will refuse it and compilation will fail. There have however been 
recent cases where CompCert printed well-formed text assembly instructions that 
did not correspond to the instruction in the data structure. The reason why 
such bugs were not caught earlier is that these instructions are rarely used. 
Commit 2ce5e496 fixed a bug resulting in some fused multiply-add instructions 
being printed with arguments in the wrong order; these instructions are selected 
only if the source code contains an explicit fused multiply-add builtin call, which 
is rare. In CompCert-KVX, commit e2618b31 fixed a bug—‘“nand” instructions 
would be printed as “and”; “nand” is selected only for the rare ~(a & b) pattern. 
The bug was found by compiling randomly generated programs. 

In some early versions of CompCert there used to be a code generation bug [57, 
§3.1] that resulted in an exceedingly large offset being used in relative addressing 
on the PowerPC architecture; this offset was rejected by the assembler. Similar 


32 Inline assembly is so error-prone that specialized tools have been designed to check 
that pieces of assembly code match their read/write/clobber specification [46]. 
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issues surfaced later in CakeML on the MIPS-64 architecture [20] and in Com- 
pCert on AArch64 (commit c8ccecc). This is a sign that constraints on immediate 
operand sizes are easily forgotten or mishandled,** and a caution: incorrect value 
sizes could result in situations not resulting in assembler errors. 


5.2 Pseudo-Instructions 


In addition to instructions corresponding to actual assembly instructions, the 
assembler abstract syntax in CompCert features pseudo-instructions, or macro- 
instructions, most notably: allocation and deallocation of a stack frame; copying 
a memory block of a statically known size; jumping through a table. The rea- 
sons why these are expanded in unverified OCaml code are twofold. First, the 
correspondence between the semantics of such operations and their decomposi- 
tion cannot be easily expressed within CompCert’s framework for assembly-level 
small-step semantics, especially the memory model. CompCert models memory 
as a set of distinct blocks, and pointers as pairs (block identifier, offset within 
the block); 34 stack allocation and deallocation create or remove memory blocks 
by moving the stack pointer, which is just a positive integer. Jump tables (used 
for compiling certain switch statements) are arrays of pointers to instructions 
within the current function, whereas CompCert only knows about function point- 
ers. Second, their expansion may use special instructions (load/store of multiple 
registers, hardware loops...) not normally selected, the behavior of which may 
be difficult to express in the semantics?” or the memory model. This is typically 
the case for memory copy; see below. 


Stack Frame (De)Allocation Stack (de)allocation pseudo-instructions address 
the gap between the abstract representation of the memory as a set of blocks 
completely separated from each other and the flat addressing space implemented 
by most processors, call frames laid out consecutively, allocation and deallocation 
amounting to subtracting or adding to the stack pointer. A refined view, with 
a correctness proof going to the flat addressing level, was proposed for the x86 
target [55] but not merged into mainline CompCert. 


33 For instance, CompCert-KVX generates loads and stores of register pairs on AArch64, 
with special care: their offset range is smaller than for ordinary loads and stores. 
This reflects the C standard’s view that variables and blocks live each in their own 
separate memory space. For instance, in C, comparisons between pointers to dis- 
tinct variables have undefined behavior [4, §6.5.8]. Some CompCert versions in which 
pointers truly are considered to be integers have been proposed [7,43]. 

Hardware loops, on processors such as the KVX, involve special registers. When the 
program counter equals the “loop exit” register, and there remain loop iterations to 
be done, control is transferred to the location specified by the “loop start” register. 
In all existant CompCert assembly language semantics, non-branching instructions 
go to the next instruction. Modeling hardware loops would thus involve changing 
all instruction semantics to transfer control according to whether the loop exit is 
reached, proving invariants regarding the hardware loop registers, etc. This could be 
worth it if the hardware loops could be selected for regular code, not just builtins, 
but this itself would entail considerable changes in previous compiler phases. 
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Loading Constants Certain instructions may need some expansion and case anal- 
ysis, and possibly auxiliary tables. For instance, on the ARM architecture, long 
constants must be loaded from constant pools addressed relatively to the pro- 
gram counter; thus emitting a constant load instruction entails emitting a load 
and populating the constant pool, which must be flushed regularly since the 
range of adressing offsets is small. Getting the address of a global or local symbol 
(global or static) variable may also entail multiple instructions, and perhaps a 
case analysis depending on whether the code is to be position-independent, and, 
in CompCert-KVX, whether the symbol resides in a thread-local program sec- 
tion.°° The low-level workings of the implementation of these pseudo-instructions 
rely on the linker performing relocations, on the application binary interface 
specifying that certain registers point to certain memory sections, etc. 


Builtins CompCert allows the user to call special “builtins”, dealing mainly with 
special machine registers and instructions (memory barriers, etc.). These builtins 
are expanded in Asmexpand or TargetPrinter into actual assembly instructions. 

As an example, consider the memory copy builtin, which may both be used by 
the user (with _builtin_memcpy_aligned()) to request copying a memory block of 
known size, and is also issued by the compiler for copying structures. Expanding 
that builtin may go through a case analysis on block size and alignment: smaller 
blocks will be copied by a sequence of loads and stores, larger blocks using a loop. 
The scratch registers may be different in each case, and this case analysis must 
be replicated in the specification; alternatively, the specification may contain 
a upper-bound on the set of clobbered registers, but in any case no clobbered 
register should be forgotten. There may also be a complicated distinction of cases 
regarding which source register is alias to which other source register, or which 
scratch one. A bug in that builtin, which did not check alignment and generated 
improper offsets for load instructions, was found in CompCert on AArch64; the 
assembler would reject the generated code (commit c8ccecc). Another bug in the 
same builtin, on four architectures (ARM, AArch64, PowerPC, RISC-V), due 
to an incorrect test about register aliasing, resulted in successful compilation, 
assembly and linking with incorrect code being emitted (commit c2c871c). 

One bug was found in the CompCert-KVX stack frame allocation code, which 
had no adverse consequence unless a very large stack frame or many parameters 
were used, which explains why it was not detected earlier (commit fccfa9). 


Clobbered Registers Expansions of pseudo-instructions and builtins often use 
scratch registers. The registers that are clobbered by each pseudo-instruction and 
builtin are defined in the Coq file (Asm.v) giving the semantics of the abstract 
assembly language. Thus, changes to expansions must affect coherently both the 
Asm.v specification and the AsmExpand and/or TargetPrinter OCaml module. 


36 In C11 [4], the _Thread_local storage class specifies that one separate copy of the 
variable exists for each thread. Typically, a processor register points to the thread- 
local memory area and these variables are accessed by offsets from that register. 
CompCert has no notion of concurrency, but on the KVX, some system variables are 
thread-local and must be accessed as such even from single-threaded programs. 
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In the last few years, several specification bugs about registers clobbered by 
pseudo-instructions and builtins were found in CompCert, on several architec- 
tures. Commit Odf99dc4 fixes several wrong specifications of clobbered registers 
on AArch64; commit a4cfb9c2 on ARM;commit 39710f78 on RISC-V. It seems 
that none of these bugs could result in the generation of incorrect code, for 
the registers that were wrongly specified not to be clobbered were not used by 
the CompCert code generator to store persistent data. The problem is that it 
was possible to modify the code generator with full correctness proof, and have 
CompCert generate incorrect code. For instance, some pseudo-instructions would 
use the return address register as a scratch register, not specified as clobbered. 
Some compilers perform leaf function optimization: the prologues and epilogues 
of functions that never call other functions do not save and restore the return 
address. CompCert applies this optimization only on the PowerPC architecture, 
and even then only partially; if one had added this optimization to AArch64 or 
RISC-V, incorrect code would be generated in leaf functions using the wrongly 
specified pseudo-instructions, though all proofs would go through. 

Bugs in expansion of builtins due to incorrect specification of clobbered reg- 
isters (or memory), and those related to outcome depending on compiler choices 
(e.g., register aliases), eerily resemble those due to improper use of inline assem- 
bly in C programs [46]. Perhaps similar methods of validation could be used. 

As an alternative, we propose moving the parts that deal with case distinc- 
tions (register aliasing, sizes, alignments...) out of the untrusted code base into 
the trusted code base, possibly one pseudo-assembly instruction for each case. 
For instance, there could be one “memory copy” pseudo-assembly instruction 
for each different code sequence to be generated, with fixed “clobbered” regis- 
ters and explicit constraints on alignment, size etc. in the specification of the 
instruction. Verified Coq code would select the proper pseudo-instruction to use. 
This would likely avoid bugs due to case distinctions in trusted code, alleviate 
difficulties in properly specifying the pseudo-instructions and keeping this spec- 
ification synchronized with their expansion, and make it easier to perform unit 
testing on the expansions. 


5.3 Microarchitectural Concerns 


CompCert-KVX introduced instruction scheduling to CompCert.*” Instruction 
scheduling reorders instructions while preserving semantics so as to minimize 
execution time. Current high-performance processors dynamically reorder in- 
structions, but this is complex and consumes extra energy; in-order processors 
need the compiler to schedule instructions for good performance, taking into 
account latencies (the number of clock cycles between the operands of an in- 
struction being read and the results being produced) and resource constraints 
(the number of instructions that can be simultaneously executed; e.g., a proces- 
sor may be able to execute two instructions at a time, but only one of them may 
be a memory access, and only one of them may be floating-point). 


37 Tristan & Leroy [54] had developed scheduling for CompCert but their developments 
were not made publicly available, let alone integrated into CompCert releases. 
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Tables of resource uses and latencies are cumbersome to build, and often 
involve access to private documentation and/or reverse engineering; there are 
thus likely incorrect.’ Fortunately, all targets of CompCert-KVX have interlocked 
pipelines, meaning that, if a value is read from a register that awaits a write, 
the instruction is stalled; thus sequential semantics are preserved: the worst 
that can happen if incorrect latencies are used is that the pipeline stalls for 
some cycles, which is a performance, not a correctness, issue. In contrast, on 
processors with non-interlocked pipelines the latencies belong to the semantic 
definition of the assembly code: a read from a register that awaits a write yields 
the previous value held in that register. Regarding resource constraints, on a very 
large instruction word (VLIW) processor, bundles of instructions that exceed 
resource constraints will be refused by the assembler; on a conventional multiple- 
issue processor, successive instructions that cannot be issued at the same cycle 
for lack of resources will be issued sequentially, which is equivalent since the 
processor preserves sequential semantics even when issuing several instructions. 
We conclude that pipeline modeling issues have no impact on the correctness of 
the generated code of CompCert-KVX, but solely on its performance.*? 


5.4 Assembling and Linking 


CompCert produces assembly code in textual form, which must then be assembled 
and linked using another toolchain, such as gcc (the GNU Compiler Collection) 
or clang (LLVM). This toolchain is thus within the TCB. Absint GmbH, which 
sells the commercial releases of CompCert, also sells for certain architectures 
the Valex tool which matches the CompCert code to the binary code [37,27]. An 
alternative is direct generation of machine code, as in CakeML [31]; CompCertELF 
extends CompCert with a verified assembler for the x86 target [56]. 

Finally, CompCert’s correctness proof was originally meant for a “closed 
world”: a program wholly compiled with it as a single module. In reality, most 
large C projects are compiled from multiple files which are then linked. The 
correctness proof was later extended, in version 2.7, to account for separate 
compilation and linking, following [26]. There have been proposals for more am- 
bitious formalizations of the linking process [50], even implementing a verified 
linker for a subset of ELF on the x86-32 architecture [56]; 4° Specifying and 
proving correct a general ELF linker is itself a fairly ambitious project [28]. 


6 Modeling and Application Binary Interface Issues 


The semantics of assembly instructions is defined, for each architecture, in the 
official manuals from the architecture designers. The application binary inter- 


38 The CompCert-KVX team had private documentation on the KVX; despite that, due 
to the tedium of building tables, they had a few bugs, as shown by commit logs. Their 
tables for AArch64 and RISC-V are based on the source code of other compilers. 

39 The situation would of course be very different in the case of a tool bounding worst 
case execution time through precise processor modeling. 

40 ELF is a standard file format for object code. 
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face (ABI), specific to each combination of architecture and operating system 
(or execution environment), defines how parameters are to be passed (in which 
registers, etc.), what kind of different global symbols exist and how they are 
accessed, what registers are reserved for system use, how the execution stack is 
to be laid out, what values the high-order bits of long registers may contain if 
the register contains a shorter value, etc. In contrast, CompCert’s vision of val- 
ues is somewhat abstract, even at the assembly level, which may pose problems 
especially when interfacing to other parts of the runtime system. 


6.1 Modeling of Values 


CompCert considers that a value, e.g., stored in a register, is either a 32-bit 
integer; a 64-bit integer; a 32-bit single precision floating-point number; a 64-bit 
double precision floating-point number; a pointer, consisting in a block identifier 
and an offset; or “undefined”, a value that can be refined into any other value, 
modeling undefined behavior that does not stop program execution (because not 
yet externally observed). This is, however, an abstraction of reality. Pointers, in 
reality, are not a pair (block, offset) but a single 32-bit or 64-bit integer. How is 
a 32-bit value stored in a 64-bit register? Are the higher-order bits indifferent, 
supposed to be 0 (0-extension) or equal to the sign bit (sign-extension)? 

These modeling issues have subtle consequences on the implementation of cer- 
tain instructions. If the application binary interface specifies that 32-bit values 
stored in 64-bit processor registers are 0-extended, then the 0-extension opera- 
tion as defined in CompCert (taking a 32-bit unsigned value and returning the 
same value as a 64-bit unsigned integer) can be implemented as a no-operation 
at assembly level (with the special annotation, for the register allocator, that 
the target register should be the same as the source register).*! Similarly, if 
the application binary interface specifies that 32-bit values stored in 64-bit pro- 
cessor registers are sign-extended, then the sign-extension operation as defined 
in CompCert can be implemented as a no-operation at assembly level. Finally, 
the application binary interface may specify that the higher 32 bits of a 64-bit 
register containing a 32-bit value are arbitrary. 

Since none of the CompCert semantics specifies register contents at the bit 
level, it is up to the backend designer to be consistent in what instructions 
assume and ensure, and this consistency is never formally verified. Consistency 
must extend to the foreign function interface: for instance, if a CompCert function 
is called from a function compiled with another compiler that considers that the 
higher order 32 bits contain arbitrary values, but CompCert assumes that values 
are 0-extended, then incorrect behavior may ensue. 

The modeling of certain instructions is delicate. The KVX processor sup- 
ports, in addition to normal loads from memory, speculative loads, otherwise 


41 This also explains why on some platforms, the code produced by CompCert contains 
useless moves. If a 32-bit value needs to be extended to 64 bits in a way that both 
the 32-bit and 64-bit version are live after extension, then these two values, even 
if they are implemented by the same bit-string, will have to reside in two different 
registers, since CompCert value semantics distinguishes 32-bit from 64-bit values. 
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known as non-trapping or dismissible loads. A normal load from an incorrect 
memory address will trap; on the KVX, a speculative load from an incorrect ad- 
dress returns 0 instead of trapping. Here, “incorrect” is meant with respect to the 
page tables of the processor. In the intermediate representations of CompCert- 
KVX, speculative loads from incorrect memory locations return the special value 
“undefined”, whereas a normal load would terminate execution. “Undefined” is 
a form of “poison value” propagating through operations, e.g., adding it to an 
integer yields “undefined”. The assembly-level semantics, however, defined the 
value returned by a speculative load from an incorrect memory location as 0, as 
per the processor documentation. 0 is a valid refinement of “undefined”, and the 
proofs go through. This is however incorrect modeling, because it conflates two 
different notions: memory accesses invalid with respect to CompCert semantics, 
and memory accesses invalid with respect to the processor memory management 
unit:4? the former are strictly included in the latter:*°, a valid CompCert mem- 
ory block may occupy a portion of a valid memory page, but the processor will 
allow accesses to the whole page. Using this incorrect semantics, one could per- 
form a speculative load from a location known to be incorrect with respect to 
CompCert semantics (for instance, just past the end of a block allocated on the 
stack) and assume that this load would return 0, whereas this location, when 
read, would return another value. Commit 5798f56b replaced this default value 
by “undefined”, which is correct: any value is a valid refinement of “undefined”. 


6.2 Foreign Function Interface 


CompCert’s application binary interface (ABI) is not specified in a single point in 
CompCert: it comprises the calling convention, the value conventions implicit in 
the choice of instructions, etc. The correctness theorem of CompCert relates the 
execution of a C program, started from the main function, to the execution of 
the assembly program produced by its compilation, also started from the main 
function. It does not discuss functions compiled with other compilers calling a 
function compiled using CompCert. It also assumes that functions called from 
CompCert use the same calling convention. As explained in CompCert’s manual 


CompCert attempts to generate object code that respects the Application 
Binary Interface of the target platform and that can, therefore, be linked 
with object code and libraries compiled by other C compilers. 


The manual then describes areas where CompCert’s ABI differs from those of 
other compilers on the targets that it supports. Again, none of these other ABIs 
were formalized, so the statement of differences in the manual is not based on 
formal analysis of compatibility, but rather on human analysis. 


42 Or, rather, the association of the processor memory management unit and the virtual 
memory subsystem of the operating system. 

43 In the case of memory over-commit by the OS, a valid memory access with respect 
to CompCert semantics may result in a segmentation violation. We do not consider 
this issue here, since it is a case of the OS promising resources to the program then 
reneging on its promises, and thus not supplying a stable execution environment. 


The Trusted Computing Base of the CompCert Verified Compiler 227 


6.3 Runtime System 


The runtime system for C is rather limited compared to other languages. It 
uses the C standard library supplied by the target platform. CompCert makes 
no assumption about it—calls to the standard library are just calls to external 
functions, and the sequence of these calls, as observable events, in the source se- 
mantics is reflected in the assembly code—except for the heap memory allocation 
and deallocation functions malloc() and free(), which have special treatment 
and are given specific semantics (creation and destruction of memory blocks in 
the CompCert memory model). CompCert assumes that this allocator is correct 
with respect to CompCert’s infinite memory model. In particular, CompCert as- 
sumes that malloc always succeeds and never returns the null pointer, which 
seems unsound: in theory, some formally verified optimizations may incorrectly 
remove defensive checks against heap overflow. In practice, we do not know of 
any optimization in CompCert exploiting this model of malloc. This assumption 
of infinite memory has been removed in CompCertS|7], at the price of a large 
extension of CompCert. 

In CompCert, basic floating-point operations have a semantics defined ac- 
cording to IEEE-754 in round-to-nearest mode. This assumes no change to the 
rounding mode through a library call or direct access to special CPU registers. 

Some processors do not support some expensive arithmetic operations (e.g. 
floating-point operations, division) in hardware. These are replaced by calls to 
functions in the runtime system, which are axiomatized to perform the required 
operation by a combination of elementary instructions. This creates a somewhat 
paradoxical situation where, for the same operation (say, 32-bit integer division): 
(i) if the operation is implemented in hardware, then it is trusted; (ii) if imple- 
mented in software through a call to the runtime system, then it is trusted; (iii) if 
implemented in software through expansion inside CompCert, then one has to 
provide a full proof that this expansion implements the operation: its execution 
coincides with that of the operator on argument values on which this operator 
has defined behavior. One argument is that the hardware is likely to have been 
designed from existant floating-point designs and thoroughly tested with many 
test vectors,** Software emulation is likely to be from a well-tested established 
library,4° whereas expansion in CompCert probably has not been tested so well. 


7 Insights and Conclusion 


Some natural questions about “verified” software is: how truly safe is it? What 
kind of constructs should we be considered as suspicious? As more designs come 


ae E.g. the Berkeley hard float library (https://github.com/ucb-bar/berkeley-hardfloat) is 
used in certain RISC-V designs. Yet, they remind potential users that “These units 
are works in progress. They may not be yet completely free of bugs [...]”. 

a E.g. the Berkeley soft float library (http://www.jhauser.us/arithmetic/SoftFloat.html); 
but, again “Releases 3 through 3c of Berkeley SoftFloat contain bugs in the square 
root functions that may be of concern for some uses. Those bugs are believed to be 
repaired in Release 3d and later.” 
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with some formal proofs of correctness, even regulatory agencies have had to 
provide guidelines [1]. It is of course perilous to draw general conclusions from 
the analysis of one single project; here are some insights. 

None of the problems found were in the verified parts of CompCert: chances 
seem slim to stumble into a proof checker bug by accident, not notice something 
is amiss, and think to have proved a theorem that actually does not hold. This 
explains why the number of bugs found in CompCert releases is many orders of 
magnitude below usual compilers [52]. By construction, the bugs of CompCert 
are located in a limited subpart of the software, called its TCB, which may 
however not be as small as we may naively expect. 

Two bugs were found in the front-end elaboration rules, “corner cases” that 
should be rarely found in real programs (thus their late discovery). A few sub- 
tle semantic bugs were also found in some back-ends. However, most bugs were 
found in the very last part of the back-end, which expands and prints assembly 
instructions. The causes of these bugs are: (i) the tedium of writing correct print- 
ers for each instruction with appropriate operand ordering, and the lack of sys- 
tematic unit testing of the printers; (ii) the number of different cases, especially 
in the choice of register arguments, in the expansion of pseudo-instructions, and 
again the lack of systematic testing that all cases are correct; (iii) the difficulties 
in keeping synchronized the specification of the pseudo-assembly instructions (in 
Coq) and the code performing their expansion, in two different files. All these 
seem to be common software engineering issues, amenable to standard software 
engineering solutions such as systematic testing of all cases. 

All these issues pertain to the specification and trusted (but unverified) parts 
of the CompCert back-end, which echoes the results of early experiments that 
found bugs in these parts [57]. In contrast, no bugs due to the use of axioms for 
interfacing untrusted code, or the use of the extractor to OCaml, were found. 
In academic circles, however, much attention is often given to doing away with 
such axioms and the extractor; this may not reflect the most pressing needs. 
There seems to be a chasm between, on the one hand, what feels relevant and 
interesting for experts in proof assistants or type theoreticians, on the other hand 
what would actually increase reliability in verified compilers or similar tools. 

In our opinion, the primary focus for increasing trust in CompCert (and re- 
moving possible further bugs) should be a validation mechanism of its assembly 
and ABI specification with respect to the actual execution platform. For ex- 
ample, SAIL provides a formal ISA semantics for ARMv8 that has been tested 
against the ARM Architecture Validation Suite [5]. However, CompCert cannot 
be directly plugged on SAIL, because of its more abstract view of the ISA. And 
this would not solve the issues related to the runtime environment and the ABI. 
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Abstract. The rise of persistent memory is disrupting computing to 
its core. Our work aims to help programmers navigate this brave new 
world by providing a program logic for reasoning about x86 code that 
uses low-level operations such as memory accesses and fences, as well as 
persistency primitives such as flushes. Our logic, PIEROGI, benefits from a 
simple underlying operational semantics based on views, is able to handle 
optimised flush operations, and is mechanised in the Isabelle/HOL proof 
assistant. We detail the proof rules of PIEROGI and prove them sound. 
We also show how PIEROGI can be used to reason about a range of 
challenging single- and multi-threaded persistent programs. 


Keywords: Persistent memory, x86-TSO, Owicki-Gries, Isabelle/HOL, verifi- 
cation 


1 Introduction 


In our era of big data, the long-established boundary between ‘memory’ and 
‘storage’ is increasingly blurred. Persistent memory is a technology that sits in 
both camps, promising both the durability of disks and data access times similar 
to those of DRAM. Embracing this technology requires rethinking our decades- 
old programming paradigms. As data held in memory is no longer wiped after a 
system restart, there is an opportunity to write persistent programs — programs 
that can recover their progress and continue computing even after a crash. 
However, writing persistent programs is extremely challenging, as it requires 
the programmer to keep track of which memory writes have become persistent, 
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and which have not. This is further complicated in a multi-threaded setting by 
the intricate interplay between the rules of memory persistency (which determine 
the order in which writes become persistent) and those of memory consistency 
(which determine what data can be observed by which threads). 

To address this difficulty, we provide a foundation for persistent program- 
ming. We develop a program logic, PIEROGI, for reasoning about x86 code that 
uses low-level operations such as memory accesses and fences, as well as per- 
sistency primitives such as flushes. We demonstrate the utility of PIEROGI by 
using it to reason about a range of challenging single- and multi-threaded per- 
sistent programs, including some that demonstrate the subtle interplay between 
optimised flush (flush,,;) and store fence (sfence) instructions. Using the Is- 
abelle/HOL proof assistant, we have mechanised the PIEROGI rules and proved 
them sound with respect to an operational semantics for x86 persistency [9]. One 
benefit of our Isabelle/HOL formalisation is that PIEROGI is already partially au- 
tomated: once the user has produced a proof outline (i.e. annotated each instruc- 
tion with a postcondition), they can simply use Isabelle/HOL’s sledgehammer, 
which automatically decides which axioms and rules of the proof system need 
invoking to verify the whole program. Our mechanisation, which includes all the 
example programs discussed in this paper, is available as auxiliary material [4,5]. 
State of the art To our knowledge, the only program logic for persistent 
programs is POG (Persistent Owicki-Gries) [31]. As with PIEROGI, POG en- 
ables reasoning about persistent x86 programs and is based on the Owicki—Gries 
method [30]. However, unlike PIEROGI, POG is not mechanised in a proof as- 
sistant, and does not support optimised flush (flush,,¢) instructions. Optimised 
flush instructions are an important persistency primitive as they are considerably 
faster than ordinary flush instructions. Indeed, Intel’s experiments on their Sky- 
lake microarchitecture indicate that they can be nine times faster when applied 
to buffers that hold tens of kilobytes of data [19, p. 289], and hence programmers 
are impelled, “If flushopt is available, use flushopt over flush.” However, flushopt 
is a tricky instruction for programmers and program logic designers alike: com- 
pared to flush, flush,,, can be reordered with more instructions under x86. 

PIEROGI can reason efficiently about x86 persistency (including flush); in- 
structions) thanks to two key recent advances: 1) Px86yiew [9], the view-based op- 
erational semantics of x86 persistency; and 2) the C11 Owicki-Gries logic [11-13] 
to reason about view-based operational semantics, which we adapt to Px&6yiew. 


Our contributions 1) We present a program logic, called PIEROGI, for reason- 
ing about persistent x86 programs. 2) We mechanise (and partially automate) 
PIEROGI in Isabelle/HOL, and prove it sound relative to an established opera- 
tional semantics for x86 persistency. 3) We demonstrate the utility of PIEROGI 
by using it to verify several idiomatic persistent x86 programs. 


Outline We begin with an overview of memory consistency and persistency 
in x86 and provide an example-driven account of PIEROGI reasoning (§2). We 
describe the assertion language and proof rules of PIEROGI in §3, and verify a se- 
lection of programs using PIEROGI in §4. We present the view-based operational 
semantics of x86 persistency and prove the soundness of PIEROGI in 85. 
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Auxiliary material Additional examples as well as the proofs of theorems 
stated in the paper are given in the accompanying technical appendix [5]. Our 
Isabelle/HOL mechanisation is available as auxiliary material [4]. 


2 Overview and Motivation 


Recent operational models for weak memory use views to capture relaxed be- 
haviours of concurrent programs [9, 11,21, 22], where the memory records the 
entire history of writes that have taken place thus far. This way, different threads 
can have different subsets of these writes (i.e. different views) visible to them. Be- 
low, we review Px86yiew, a vView-based operational semantics for x86 persistency 
(§2.1); we then describe PIEROGI (§2.2) using a series of running examples. 


2.1 Px86yieyw at a Glance 


In the literature of concurrency semantics, consistency models describe the per- 
mitted behaviours of programs by constraining the volatile memory order, i.e. 
the order in which memory writes are made visible to other threads, while per- 
sistency models describe the permitted behaviours of programs upon recovering 
from a crash (e.g. a power failure) by defining the persistent memory order, i.e. 
the order in which writes are committed to persistent memory. To distinguish 
between the two, memory stores are differentiated from memory persists: the 
former denotes the process of making a write visible to other threads, whilst the 
latter denotes the process of committing writes to persistent memory (durably). 
P2r86view Consistency The consistency semantics of Px86yiew is that of the 
well-known TSO (total store ordering) [36] model, where later (in program or- 
der) reads can be reordered before earlier writes on different locations. This is 
illustrated in the store buffering (SB) example below (left): 


store x 1; store y 1; store x 42; a := load y; 
a:=loady || b:=loadx (sB) store y 7 b:=loadx (MP) 
a=0Ab=0:V a=7TAb=0:X 


Specifically, assuming «=y=0 initially, since a :=load y (resp. b :=load x) can 
be reordered before store x 1 (resp. store y 1), it is possible to observe the weak 
behaviour a=0 A b=0. A well-known way of modelling such reorderings in TSO 
is through store buffers: when a thread 7 executes a write store «x v, its effects 
are not immediately made visible to other threads; rather they are delayed in a 
thread-local (store) buffer only visible to T, and propagated to the memory at 
a later time, whereby they become visible to other threads. For instance, when 
store x 1 and store y 1 are delayed in the respective thread buffers (and thus 
not visible to one another), then a :=load y and b :=load z may both read 0. 
Cho et al. [9] capture this by associating each thread 7 with a coherence view 
(also called a thread-observable view), describing the writes observable by 7. 
Distinct threads may have different coherence views. For instance, after executing 
store x 1 and store y 1, the coherence view of the left thread may include 
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store x 1 and not store y 1, while that of the right may include store y 1 and 
not store x 1. This way, a :=load y (resp. b :=load x) may read the initial value 
0, as its coherence view does not include store y 1 (resp. store z 1). 

After SC (sequential consistency) [27], TSO is one of the strongest consistency 
models and supports synchronisation patterns such as message passing, as shown 
in MP above, where a = 7A b = 0 cannot be observed. Specifically, (assuming 
x=y=0 initially) if the right thread reads 7 from y (written by the left thread), 
then the left thread passes a message to the right. Under TSO, message passing 
ensures that the instruction writing the message and all those ordered before it 
(e.g. store x 42;store y 7) are executed (ordered) before the instruction reading 
it (e.g. a:=loady). As such, since b :=load x is executed after a:=load y, if 
a=7 (ie. store x 42 is executed before a :=load y), then b=42. 


Px86yiew Persistency Cho et al. [9] recently developed the Px86yiew model, 
a view-based description of the Intel-x86 persistency semantics, which follows 
a buffered, relaxed persistency model. Under a buffered model, memory persists 
occur asynchronously [10]: they are buffered in a queue to be committed to persis- 
tent memory at a future time. This way, persists occur after their corresponding 
stores and as prescribed by the persistency semantics, while allowing the execu- 
tion to proceed ahead of persists. As such, after recovering from a crash, only 
a prefix of the persistent memory order may have persisted. (The alternative is 
unbuffered persistency in which stores and persists happen simultaneously.) 

Under relaxed persistency, the volatile and persistent memory orders may 
disagree: the order in which the writes are made visible to other threads may 
differ from the order in which they are persisted. (The alternative is strict per- 
sistency in which the volatile and persistent memory orders coincide.) 

The relaxed and buffered persistency of Px86vyiew is shown in Fig. la. If a 
crash occurs during (or after) the execution of Fig. 1a, at crash time either write 
may have persisted and thus x, y€ {0, 1} upon recovery. Note that the two writes 
cannot be reordered under Intel-x86 (TSO) consistency and thus at no point 
during the normal (non-crashing) execution of Fig. la is z=0, y=1 observable. 
Nevertheless, in case of a crash it is possible to observe x=0, y=1 after recovery. 
That is, due to the relaxed persistency of Px86yiew, the store order (x before y) 
is separate from the persist order (y before x). More concretely, under Px86yiew 
the writes may persist 1) in any order, when they are on distinct locations; or 
2) in the volatile memory order, when they are on the same location. 

To afford more control over when pending writes are persisted, Intel-x86 
provides explicit persist instructions such as flush z and flushopt x that can be 
used to persist the pending writes on x.° This is illustrated in Fig. 1b: executing 
flush x persists the earlier write on x (i.e. store x 1) to memory. As such, if 


* Given a cache line (a set of locations), writes on distinct cache lines may persist in 
any order, while writes on the same cache line persist in the volatile memory order. 
For brevity, we assume that each cache line contains a single location, thus forgoing 
the need for cache lines. However, it is straightforward to lift this assumption. 

5 Executing flush z or flushopt x persists the pending writes on all locations in the 
cache line of x. However, as discussed, we assume cache lines contain single locations. 
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store x 1; 
store x 1; store x 1; flushop: x; |store x 1; || a :=load y; 
store x 1; flush z; flushopt z; sfence; flush z; if (a=1) 
store y 1 store y 1 store y 1 store y 1 store y 1 store z 1 
(a) (b) (c) (d) (e) 
4:2, y€ {0,1} | ¢:y=1 > c=1) yEy yl > Il J z=] > c=1 


Fig. 1: Example Px86yiew programs and possible values after recovery from a 
crash (4). In all examples x, y, z are distinct locations in persistent memory 
such that r=y=z=0 initially, and a is a (thread-local) register. 


the execution of Fig. 1b crashes and upon recovery y=1, then x=1. That is, if 
store y 1 has executed and persisted before the crash, then so must the earlier 
store x 1;flush z. Note that y=1 => x=1 describes a crash invariant, in that it 
holds upon crash recovery regardless of when (i.e. at which program point) the 
crash may have occurred. Observe that this crash invariant is guaranteed thanks 
to the ordering constraints on flush instructions. Specifically, flush instructions 
are ordered with respect to all writes; as such, flush x in Fig. 1b cannot be 
reordered with respect to either write, and thus upon recovery y=1 > c=1. 


However, instruction reordering means that persist instructions may not exe- 
cute at the intended program point and thus not guarantee the intended persist 
ordering. Specifically, flushopt x is only ordered with respect to earlier writes on 
x, and may be reordered with respect to later writes, as well as earlier writes on 
different locations. This is illustrated in Fig. 1c: flushopt x is not ordered with 
respect to store y 1 and may be reordered after it. Therefore, if a crash occurs 
after store y 1 has executed and persisted but before flush,,, x has executed, 
then it is possible to observe y=1, x=0 on recovery. That is, there is no guarantee 
that store x 1 persists before store y 1, despite the intervening flushopt x. 


In order to prevent such reorderings and to strengthen the ordering con- 
straints between flush,,; and later instructions, one can use either fence instruc- 
tions, namely sfence (store fence) and mfence (memory fence), or atomic read- 
modify-write (RMW) instructions such as compare-and-set (CAS) and fetch- 
and-add (FAA). More concretely, sfence, mfence and RMW instructions are 
ordered with respect to all (both earlier and later) flushopt, flush and write in- 
structions, and can be used to prevent reorderings such as that in Fig. 1c. This 
is illustrated in Fig. ld. Unlike in Fig. 1c, the intervening sfence ensures that 
flush,p in Fig. 1d is ordered with respect to store y 1 and cannot be reordered 
after it, ensuring that store x 1 persists before store y 1 (i.e. y=1 > x=1 upon 
recovery), as in Fig. 1b. Note that replacing sfence in Fig. 1d with mfence or an 
RMW yields the same result. Alternatively, one can think of flush,,, x executing 
asynchronously, in that its effect (persisting x) does not take place immediately 
upon execution, but rather at a later time. However, upon executing a barrier 
instruction (i.e. mfence, sfence or an RMW), execution is blocked until the 
effect of earlier flush,,t instructions take place; that is, executing such barrier 
instructions ensures that earlier flush,,, behave synchronously (like flush). 
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Pilea=b=0A vr € {1,2}. [2] = fp ={0}} 


Pi : {7 ¢ [ylz Aa=0} Qı : {lyl2 € {0,7} A (7 € [y]2 > (y, 7) [ale = {42})} 
store x 42; // SP1, Cons a :=load y; // LP2 

Po : {[a]i = {42} A7 ¢ lvl} |] Q2 : {a € {0,7} A (@=7 = [alo = {42})} 
store y 7; // SP1, Cons b := load z; // LPi, Cons 

Ps : {true} Qs : {a = 7 > UE} 


Hilas Ts b=] 


Fig.2: A PIEROGI proof sketch of message passing (MP), where the // annota- 
tion at each step identifies the PIEROGI proof rule (in §3.4) applied, and the 
highlighted assertions capture the effects of the preceding instruction. 


The example in Fig. le illustrates how message passing can impose persist 
orderings on the writes of different threads. (Note that the program in the left 
thread of Fig. le is that of Fig. 1b.) As in MP, if a = 1, then store zv 1; flush x 
is executed before a :=load y (thanks to message passing). Consequently, since 
store z 1 is executed after a :=load y when a = 1, we know store x 1;flush x 
is executed before store z 1. Therefore, if upon recovery z=1 (i.e. store z 1 has 
persisted before the crash), then «=1 (store x 1; flush x must have also per- 
sisted before the crash). As before, replacing flush z in Fig. le with flushopt £; C 
yields the same result upon recovery when C is an sfence/mfence or an RMW. 


2.2 PIEROGI: View-Based Owicki—Gries Reasoning for Px86 view 


Sequential Reasoning about Consistency using Views In Fig. 2 we present 
a PIEROGI proof sketch of MP. Recall that in order to account for possible write- 
read reorderings on Intel-x86 architectures, Px86,iey associates each thread T 
with a coherence view, describing the writes visible to 7. To reason about such 
thread-observable views, PIEROGI supports assertions of the form [z]; = S, 
stating that 7 may read any value in the set $ for location x. That is, the 
coherence view of 7 for x consists of the writes whose values are those in S. 

In the remainder of this article we enumerate the threads in our examples 
from left to right; e.g. the left and right threads in Fig. 2 are identified as 1 
and 2, respectively. Moreover, we assume the registers of distinct threads have 
distinct names. The precondition P in Fig. 2 thus states that both threads may 
initially only read 0 for both x and y: Vr € {1, 2}. [z]-=[y], = {0}. 

In the case of thread 1, we can weaken P (using the standard rule of conse- 
quence of Hoare logic — see Cons in §3) to obtain P;. Upon executing store x 42 
(1) we weaken the resulting assertion by dropping the a = 0 conjunct; and 
(2) we update the observable view of thread 1 on x to reflect the new value of 
x: [æ]ı = {42}; that is, after executing store x 42, the only value observable 
by thread 1 for x is 42. Similarly, after executing store y 7, we could assert 
[y]ı = {7}; however, this is not necessary for establishing the final postcondition 
Q, and we thus simply weaken the postcondition to true (P3). 
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P = {0}} 

store x 1; //SPi 

= {1} A [yl = {0}} 

flushopt x; // OP1 

1={1} A [x] ={1} A [yl ={0}} 
sfence; // SFP 

1={1} A [x]? ={1} A [y]?={0} } 
store y 1; ors 

ie: T Pie je yli={1}} 
Ke : (yl? = {1} > [a]? = {1} 


and Fig. 1d i 
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P = {0}} 

store x 1; SPi 

= {1} A [y]? = {0}} 

‘flush T; ii es 

= {1} A [a]? = {1} A [yl]? = {0}} 
store y 1; // SP1 

at = {1} A fel’ = {Al [yi = {1}} 
Ke + [yl = {1} => e = {13} 
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Fig. 3: Proof sketches of Fig. 1b 


=| 


Analogously, in the case of thread 2 we weaken P to obtain Q1: [y]2 = {0} 
implies [y]2 C {0,7} and 7 € [y]2 => (y,7)[z]o = {42}. Note that 7 € [y]2 => 
(y, D [æ]2 = {42} yields a vacuously true implication as [y]2 = {0} and thus 
7 € [y]2. The (y,7)[z]2 denotes a conditional view assertion [11] that describes 
how reading a value on one location (y) affects the thread-observable view on a 
different location (x). More concretely, (y, 7)[x]2 = {42} states that if thread 2 
executes a load on y and reads value 7, it subsequently may only observe value 
42 for x. This is indeed the essence of message passing in MP: once thread 2 
reads 7 from y, it may only read 42 for x thereafter. As such, after executing 
the read instruction a:=loady (1) we apply the LP; rule (in Fig. 7) which 
simply replaces [y]2 with the local register a in which the value of y is read; and 
(2) we replace the conditional assertion (y,7)[z]2 = {42} with the implication 
a= 7 > |æ]2 = {42}, stating that if the value read by thread 2 for y (in a) is 
7, then its observable view for x is {42}. Similarly, upon executing b :=load x 
we simply apply LP, to replace [2]2 with the local register b in which the value 
of x is read. Lastly, the final postcondition Q is given by the conjunction of the 
thread-local postconditions (P A Q3). 


Concurrent Reasoning and Stability In our description of the PIEROGI 
proof sketch in Fig. 2 thus far we focused on sequential (per-thread) reasoning, 
ignoring how concurrent threads may affect the validity of assertions at each 
program point. Specifically, as in existing concurrent logics |11, 26, 30,31], we 
must ensure that the assertions at each program point are stable under con- 
current operations. For instance, to ensure that P, remains stable under the 
concurrent operation a :=load y, we require that executing a :=load y on states 
satisfying the conjunction of P, and the precondition of a:=load y (i.e. Q1) 
not invalidate Pı, in that the resulting states continue to satisfy P,; that is, 
{Pi A Qı}a := load y{ Pi} holds. Similarly, we must ensure that Pı is stable 
under b :=load z, i.e. {P A Q2}b := load «{P,} holds. Analogously, we must 
establish the stability of P2, P3, Q1, Q2 and Q3 under concurrent operations. In 
§3 we present syntactic rules that simplify the task of checking stability obliga- 
tions. It is then straightforward to show that the assertions in Fig. 2 are stable. 
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Reasoning about flush Persistency To reason about the relaxed, buffered 
persistency of Px86yiew, Cho et al. [9] introduce persistency views, determining 
the possible persisted values for each location; i.e. the values of those writes that 
may have persisted to memory. Note that the persistency view determines the 
possible values observable upon recovery from a crash. By contrast, the (per- 
thread) coherence views determine the observable values during normal (non- 
crashing) executions, and have no bearing on the post-crash values. 

Analogously, we extend PIEROGI with assertions of the form [x]? = S, stating 
that the persistent view for x includes writes whose values are given by S. To 
see this, consider the PIEROGI proof sketch of Fig. 1b in Fig. 3 (left). Initially, 
y holds 0 in persistent memory: [y]? = {0}. (Note that the precondition could 
additionally include [z]; = [y], = {0} A [z]? = {0} to denote that initially the 
thread may only observe 0 for x and y and that x holds 0 in persistent memory; 
however, this is not needed for the proof and we thus forgo it.) 

As before, after executing store x 1, the observable value for x is updated, as 
denoted by [x]; = {1}. Moreover, after executing flush z, the persisted value for 
x is 1, as denoted by [x]? = {1}, by committing (persisting) the observable value 
for x (i-e., [z] = {1}) to memory (see FP, in Fig. 7). Finally, after executing 
store y 1, the observable value for y is updated, as denoted by [y]i1 = {1}. 


Crash Invariants Recall that 4: y=1 = x=1 in Fig. 1b denotes a crash in- 
variant in that it describes the persistent memory upon recover from a crash at 
any program point. This is because we have no control over when a crash may 
occur. To capture such invariants, in PIEROGI we write quadruples of the form 
{P} C {Q} : I}, where {P} C {Q} denotes a Hoare triple and 7 denotes 
the crash invariant. If C is a sequential program, J must follow from every as- 
sertion (including P and Q) in the proof. For instance, in the proof outline of 
Fig. 3 (left) all four assertions imply the invariant [y]? = {1} > [a]P = {1}. We 
discuss the meaning of crash invariants for concurrent programs below. 
Reasoning about flush,» Persistency Recall that unlike flush, flushopt 
instructions (due to instruction reordering) may behave asynchronously and 
their effects may not take place immediately after execution. As such, unlike 
for flush x, after executing flushopt x we cannot simply copy the observable 
view on «x to the persistent view on 2. 

To capture the asynchronous nature of flushopt, Cho et al. [9] introduce 
yet another set of views, namely the thread-local asynchronous view: the asyn- 
chronous view of thread 7 on x describes the values (writes) that will be persisted 
at a later time (asynchronously) by 7 upon executing a barrier instruction. That 
is, 1) when thread 7 executes flushopt x, its asynchronous view of x is advanced 
to at least its observable view of x; and 2) when 7 executes a barrier (sfence, 
mfence or RMW), then its persistent view for each location is advanced to at 
least its corresponding asynchronous view. We model this in PIEROGI by 1) set- 
ting [z]4 to be a subset of [x]; when flusho,; x is executed; and 2) setting [a]? 
to be a subset of [z]* (for each location x) when a barrier is executed. 

This is illustrated in the proof sketch of Fig. 1d in Fig. 3 (right). In particular, 
unlike the proof sketch of Fig. 1b in Fig. 3 (left), after executing flush,,; x we 
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P:{a=0AVo€e {a,y, z},7 € {1,2}. lo], = fo]? = {0}} 


P,: {[y]2 = {0} A [z]? = {0} Aa = 0} {true} 

store x 1; // SP a :=load y; 
Pa : {[yl2 = {0} A [2]? = {0} A a = 0 A [ala = {1}} || {true} 

flush z; // FP1, Cons if (a = 1) 


E) fa=1} 
re y 1; // SP1, Cons store z 1; 
P {1}} {true} 

Q: {fe = {1}} 

1: {4 : eP ={ = e = {4} 


Fig. 4: A PIEROGI proof sketch of Fig. le 
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cannot simply copy the thread-observable view to the persistent view. Rather, 
we copy the thread-observable view [z]; to its asynchronous view and assert 
[x] = {1}; and upon executing the subsequent sfence, we copy the thread- 
asynchronous view to the persistent view and assert [x]? = {1}. 


Putting It All Together We next present a PIEROGI proof sketch of Fig. le 
in Fig. 4. The proof of the left thread is analogous to that in Fig. 3 (left); 
the proof of the right thread is straightforward and applies standard reasoning 
principles. The final postcondition Q is obtained by weakening the conjunction 
of per-thread postconditions. 

Note that the crash invariant I follows from the assertions at each program 
point of thread 1 (i.e. Py V Pp V P3 V P4 = I). That is, the crash invariant must 
follow from the assertions at all program points of some thread (e.g. thread 1 
in Fig. 4). In the case of sequential programs (e.g. in Fig. 3), this amounts to all 
program points (of the only executing thread). Intuitively, we must ensure that 
the crash invariant holds at every program point regardless of how the underlying 
state changes. As the assertions are stable under concurrent operations, it is 
thus sufficient to ensure that there exists some thread whose assertions at each 
program point imply the crash invariant. 


3 The PIEROGI Proof rules and Reasoning Principles 


We proceed with a description of our verification framework. As with prior 
work [11], the view-based semantics for persistent TSO [9] allows us to use the 
standard Owicki-Gries rules [2,30] for compound statements. The main ad- 
justment is the introduction of a new specialised assertion language capable of 
expressing properties about the different “views” described intuitively in §2. As 
such, since view updates are highly non-deterministic, the standard “assignment 
axiom” of Hoare Logic (and by extension Owicki—Gries) is no longer applicable. 
Moreover, unlike SC, reads in a weak memory setting have a side-effect: their 
interaction with the memory location being read causes the view of the executing 
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v, uc VaL £N £,Yy,...€ Loc a,b,...€ REG TETD AN i,j,k,...€ LAB 


â, b,... € AUXVAR êc AuXEXP ::= v | â | €+é|--- 
e € ExP ::=v | a | e+e |- -- B € BEX ::= true | BAB |- 
a € AST ::=skip | a :=e | a :=load z | store x e 


|a:=CAS z e e | sfence | mfence | flush x | flushopt £ 
ls € LST ::= a goto j | if B goto j else to k | (a goto j, â := é) 
II € Proc £ Tip x LAB > LST pe € PC £ Ti > LAB 


Fig. 5: The PIEROGI domains and programming language 


thread to advance. Therefore, we resort to a set of proof rules that describe how 
views are modified and manipulated, as formalised by our view-based assertions. 


3.1 The PIEROGI Programming Language 


We present the programming language in Fig. 5. Atomic statements (in AST) 
comprise skip, assignment, memory reads and writes, barrier instructions and 
explicit persists. Specifically, a :=e evaluates expression e and returns it in 
(thread-local) register a; a :=load x reads from memory location x and returns 
it in register a; and store x e writes the evaluated value of e into location x. The 
a:=CAS rz e1 e2 denotes ‘compare-and-set’ on location x, from the evaluated 
value of e; to the evaluated value of e2, and sets a to 1 if the CAS succeeds and 
to 0, otherwise. Finally, mfence denotes a memory fence, sfence denotes a store 
fence, and flush z and flush.) x denote explicit persist instructions (see §2). 

Formally, we model a program JI as a function mapping each pair (r,i) of 
thread identifier and label to the labelled statement (in LST) to be executed. A 
labelled statement may be 1) a plain statement of the form a goto j, comprising 
an atomic statement a to be executed and the label 7 of the next statement; 
2) a conditional statement of the form if B goto j else to k to accommodate 
branching, which proceeds to label j if B holds and to k, otherwise; or 3) a state- 
ment with an auxiliary update (a goto j,â := ê), which behaves as a goto j, 
but in addition (in the same atomic step) updates the value of the auxiliary 
variable â with the auxiliary expression ê. It is well known that Owicki-Gries 
proofs require auxiliary variables to record the history of executions to differ- 
entiate states that would otherwise not be distinguishable [30]. We show how 
auxiliary variables are used in PIEROGI in the flush buffering example (§4). 

We track the control flow within each thread via the program counter func- 
tion, pc, recording the program counter of each thread. We assume a designated 
label, « € LAB, representing the initial label; i.e. each thread begins execu- 
tion with pc(r) = c. Similarly, Ç € LAB represents the final label. Moreover, 
if pe(r) = i at the current execution step, then: 1) when H(r,i) = a goto j 
or II(r,1) = (a goto j,a := ê), then pce(r) = j at the next step; 2) when 
II(r,i)=if B goto j else to k at the current step, then if B holds in the current 
state, then pc(r)= 7 at the next step; otherwise pc(r)=k at the next step. 
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Example 1. The program in Fig. 4, assuming that the left thread has id 1, is 
given as follows. The formalisation of the right thread is omitted, but is similar. 
mê (1,1) + store x 1 goto 2, (1,2) > flush z goto 3, 
(1,3) > store y 1 goto ¢,... 


3.2 View-Based Expressions 


As with prior work on the RC11 model [21], we interpret PIEROGI expressions 
directly over a view-based state. We use expressions tailored for the view-based 
Px86yiew model [9], which allow us to express relationships between different 
system components, including the persistent memory. 

Our expressions fall into one of four categories: 1) current view expressions, 
which describe the current views of different system components (e.g. the per- 
sistent view); 2) conditional view expressions [11], which describe a view on a 
location after reading a particular value on a different location; 3) last view ex- 
pressions, which hold if a component is viewing the last write to a location; and 
4) write-count expressions, which describe the number of writes to a location. 

Our current view expressions comprise [z],, [x]? and [z]4, as described below; 
as shown in §2, each of these expressions describes a set of possible values. 


[x], denotes the coherence view of thread 7: the set of values 7 may read for zx. 

[x]? denotes the persistent memory view: the set of values that x may hold in 
(persistent) memory. 

[x]* denotes the asynchronous memory view of thread T: the set of values that 
can be persisted after a barrier instruction (sfence/mfence/RMW) is ex- 
ecuted by 7 (see rule OP in Fig. 7). Asynchronous views are updated after 
executing a flushopt; however, unlike persistent memory views, the values 
in asynchronous views are not guaranteed to be persisted until a subsequent 
barrier is executed by the same thread. 


Conditional view expressions are of the form (x, v)[y],, as described below. 
As discussed in §2, conditional expressions capture the crux of message passing. 


(x, v)[y], returns a set of values that 7 may read for y after it reads value v 
for x. In particular, if (x, v)}[y]- = S holds for some set S and T executes 
a:=loadz, then in the state immediately after the load, if a = v, then 
yl; C S (see LP; in Fig. 7). 


Last-view expressions (cf. [16]) are boolean-valued and hold if a particular 
component is synchronised (i.e. observes the latest value) on the given location. 
Such expressions provide determinism guarantees on load and flush. For in- 
stance if the view of 7 is the last write on x, then a read from x by 7 will load 
this last value. Last-view expressions comprise |x] and [z]: 


[x], holds iff r is currently viewing the last write to x. Thus, for example, if 
[a], holds, then a load from x by 7 reads the last write to x. Note that 
unlike architectural operational models [36], in the view model [9], writes are 
visible to all threads as soon as they occur. 
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[zE holds iff a flush of x by 7 is guaranteed to flush the last write to x to 
persistent memory. 


Lastly, write-count expressions are of the form |, v|, as described below. Such 
assertions are useful for inferring view expressions from known facts about the 
number of writes in the system with a particular value (see Fig. 11). 


|x, v| returns the number of writes to x with value v. If |x, v| holds and r writes 
to y Æ x, or writes a value u Æ v, then |x, v| continues to hold afterwards. 


3.3 Owicki—Gries Reasoning 


We present the PIEROGI proof system, as an extension of Hoare Logic with 
Owicki—Gries reasoning to account for concurrency. The main differences are that 
1) our program annotations contain view-based assertions that allow reasoning 
about weak and persistent memory behaviours; and 2) we define a crash invariant 
to describe the recoverable state of the program after a crash. We proceed by first 
defining proof outlines, then providing syntactic rules for proving their validity. 
Our proof rules are syntactic, and thus can be understood and used without 
having to understand the details of the underlying Px86view model. 

We let ASSERTION», be the set of assertions (i.e. predicates over Px86view 
states) that use view-based expressions (§3.2). A crash invariant, I € INV C 
ASSERTIONpy, is defined over persistent views only, i.e. it only comprises the 
persistent view expressions of the form [z]°. We model program annotations via 
an annotation function, ann E€ ANN = TID x LAB > ASSERTION py, associating 
each program point (r,i) with its associated assertion. A proof outline is a tuple 
(in, ann, I, fin), where in, fin E€ ASSERTION, are the initial and final assertions. 


Example 2. The annotation of the proof in Fig. 4 is given by ann, with the 
mappings of thread 1 as shown below; the mappings of thread 2 are similar. 

ann = { (1,1) > Py, (1,2) > Po, (1,3) > Ps, (1,6) Pay... } 
Additionally, we have in = a = 0 A Vo € {a,y,z},7 € {1,2}. [o]; = [o]? = {0}, 
fin È [x]? = {1} and 7 £ [z]? = {1} > [e]? = {1}. 


Definition 1 (Valid proof outline). A proof outline (in, ann, I, fin) is valid 
for a program IT iff the following hold: 


Initialisation. For all r € TID, in > ann(r, 1). 
Finalisation. (A ep, ann(T,¢)) > fin. 
Local correctness. For all r € TID and 7 € LAB, either: 
— II(r,i) = a goto j and {ann(r, i) } a {ann(r, j)}; or 
— II(7,i) =if B goto j else to k and both ann(7,1) \ B => ann(r, j) and 
ann(T,i) \ 7B => ann(r, k) hold; or 
— II(r,i) = (a goto j,â := ê) and {ann(r,i)} a {ann(r, j)[é/a]}. 
Stability. For all 7,72 E€ TID such that 7, Æ Tə and 71,72 E LAB: 
— if (7,41) =a goto j, then {ann(T2, i2) A ann(71, i1)} a {ann(T2, i2) }; 
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— if (71,41) = (a goto j,â := ê), then 
{ann(T2,i2) A ann(71,%1)} a {ann(2, i2)[é/a]}. 
Persistence. There exists rT € TID such that for all i € LAB, ann(r,i) > I. 


Intuitively, Initialisation (resp. Finalisation) ensures that the initial (resp. final) 
assertion of each thread holds at the beginning (resp. end); Local correctness 
establishes annotation validity for each thread; Stability ensures that each (local) 
thread annotation is interference-free under the execution of other threads [30]; 
and Persistence ensures that the crash invariant holds at every program point 
for some thread. 


Example 3. Given the program in Example 1 and its annotation in Example 2, 
both Initialisation and Finalisation clearly hold. Moreover, Persistence holds for 
thread 1. For Local correctness of thread 1, we must prove (1)—(3) below; Local 
correctness of thread 2 is similar. 


{P,} store x 1 {P2} (1) 

{P2} flush z {P3} (2) 

{Pz} store y 1 {P4} (3) 

For Stability of P (the precondition of store x 1 in thread 1) against thread 2 
we must prove: 

{P} a :=loady {P,} (4) 

{Pi \a=1} storez1 {P,} (5) 


Stability of other assertions (i.e., P>—P4) is similar. We prove (1)-(5) in §3.4. 


3.4 PIEROGI Proof rules 


One of the main benefits of PIEROGI is the ability to perform proofs at a high 
level of abstraction. In this section, we provide the set of proof rules that we use. 
The annotation within a proof outline is, in essence, an invariant mapping each 
program location to an assertion that holds at the program location. Thus, we 
prove local correctness by checking that each atomic step of a thread establishes 
the assertions in that thread. Similarly, we check stability by checking each 
assertion in one thread against each atomic step of the other threads. To enable 
proof abstraction, we introduce a set of proof rules that describe the interaction 
between the assertions from §3.2 and the atomic program steps. We will use 
the standard decomposition rules from Hoare Logic to reduce proof outlines and 
enable our rules over atomic steps to be applied. 

Standard Decomposition Rules The standard decomposition rules we use 
are given in Fig. 6, which allow one to weaken preconditions and strengthen 
postconditions, and decompose conjunctions and disjunctions. 

Rules for Atomic Statements and View-Based Assertions Weak and 
persistent memory models (e.g. Px86) are inherently non-deterministic. More- 
over in contrast to sequential consistent, in view-based operational semantics 


View-Based Owicki—Gries Reasoning for Persistent x86-TSO 247 
P’>P Q=Q {Pi} H {Qi} {Pi} H {Qi} 
cong {PHI {Q} „(P3 {Qs} j (PI (0) 
{P} U {Q} {Pi A Po} IT {Qi A Qo} {Pi V Po} H {Qi V Q2} 
Fig. 6: Standard decomposition rules of PIEROGI 
Precondition Statement E Const. | Ref. 
{jal = Si {ae SA [x], € S} LP; 
{u € [x], > (z,u)[y], = S} a :=load x {a = u > [y] C S} LP2 
AEA Pa [x] = {u}} asus je i= iat LP3 
{true} {izle = fe} } SPi 
Siela =o} Tej = SU {v}} TET SP2 
{[c\*, = 8} {{o}, = SU {v}} SPs 
lia =8} store x v ibe SP4 
{iyl = S Av ¢ [e] } { (a, vilal ‘2 3} tT #7'| SPs 
{true ilelA feTË } SP6 
{lel =r} {jæ plasty SP7 
iil = 8} tal CSA Bl eg FP; 
{fz =S} flush e {[2]? c S} FP2 
MEJE A [z] = {u} ^ Mel} { a)" = {u}} FP3 
{lel = Sv [1] = S} flush. x {[x]F C S} OP 
[iz] = SV [e] =S} sfence {|z cS} SFP 


Fig. 7: Selected proof rules for atomic statements executed by thread T 


(such as Px86view) instructions such as a :=load z have may a side-effect since 
they may update the view of the thread performing the load (cf. [11]). There- 
fore, unlike Hoare Logic, which contains a single rule for assignment, we have a 
set of rules for atomic statements, describing their interaction with view-based 
assertions. Each of the rules in this section has been proved sound with respect 
to the view-based semantics encoded in Isabelle/HOL. 

A selection of these rules for the atomic statements is given in Fig. 7, where 
the statement is assumed to be executed by thread 7. The first column contains 
the pre/post condition triple, the second any additional constraints and the 
third, labels that we use to refer to the rules in our descriptions below. Unless 
explicitly mentioned as a constraint, we do not assume that threads, locations 
and values are distinct; e.g. rule LP3 (referring to 7 and 7’) holds regardless of 
whether 7 = 7’ or not. 

The rules in Fig. 7 provide high-level insights into the low-level semantics of 
Px86yiew Without having to understand the operational details. The LP; rules 
are for statement a:=loadz. Rule LP; states that if 7’s view of x is the set 
of values S, then in the post state a is an element of S and moreover T’s view 
of x is a subset of S (since 7’s view may have shifted). By LP2, provided the 
conditional view of 7 on y (with condition x = u) is S, if the load returns value 
u, then the view of 7 is shifted so that [y], C S. We only have [y], C S in the 
postcondition because there may be multiple writes to x with value u; reading x 


248 E. Vafeiadi Bila, B. Dongol, O. Lahav, A. Raad and J. Wickerson 


read may shift the view to the latter write, thus reducing the set of values that 
T can read for y. LP3 describes conditions for a deterministic load by thread rT. 
The precondition assumes that there is only one write to x with value u, that 
some thread 7’ sees the last write to x with value u. Then, if 7 reads u, its view 
of x is also constrained to just the set containing u. 

The store rules, SP;, reflect that fact that a new write modifies the views of 
the other threads as well as the persistent memory and asynchronous views. The 
first four rules describe the interaction of a store by thread 7 with current view 
assertions. By SP1, the store ensures that the current view of 7 is solely the 
value v written by 7. This is because in Px86view, new writes are introduced by 
the executing thread, 7, with a maximal timestamp (see STORE rule in Fig. 12), 
and T’s view is updated to this new write. SP2, SP3 and SP, are similar, and 
assuming that the view (of another thread, persistent memory and asynchronous 
view, respectively) in the pre-state is S, shows that the view in the post state 
is SU {v}. Rule SPs allows one to introduce a conditional observation assertion 
(x, v)[y]-- where r’ # T. The pre-state of SP5; assumes that 7’s view of y is 
the set S, and that 7’ cannot view value v for y. Rule SP6 introduces last-view 
assertions for T after + performs a write to x, and finally SP7 states that the 
number of writes to x with value v increases by 1 after executing store x v. 

Rules FP; describe the effect of flush x on the state. FP; states that, provided 
that the current view of 7 for x is the set of values S, after executing flush x, we 
are guaranteed that both the persistent view and asynchronous view of 7 for x are 
subsets of S. We obtain a subset in the post state since the Px86view semantics 
potentially moves the persistent and asynchronous views forward. Similarly, by 
FP». if the current persistent view of x is S, then after executing flush x the 
persistent view will be a subset of S. Finally, FP3 provides a mechanism for 
establishing a deterministic persistent view u for x. The precondition assumes 
that some thread’s view of x is the last write with value u and that 7’s view is 
such that the flush is guaranted to flush to this last write to x. 

Rule OP describes how the asynchronous view of 7 in the postcondition of 
flush,,, x is related to the current view of T and the asynchronous view in the 
precondition. Finally, rule SFP describes the relationship between the persistent 
view in the postcondition and the asynchronous view and persistent view in the 
precondition for an sfence instruction. 

Our Isabelle/HOL development contains further rules for the other instruc- 
tions, including mfence and cas, which we omit here for space reasons. In 
addition, we prove the stability of several assertions (see Fig. 8 for a selection). 
An assertion P is stable over a statement a executed by 7 iff {P} a {P} holds. 


Well-formedness The final major aspect of our framework is a well-formedness 
condition that describes the set of reachable states in the Px86yiew semantics. 
The condition is expressed as an invariant of the semantics: it holds initially, and 
is stable under every possible transition of Px86yview. In fact, the rules in Figs. 7 
and 8 are proved with respect to this well-formedness condition. 

The majority of the well-formedness constraints are straightforward, e.g. de- 
scribing the relationship between the views of different components. The most 
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Statement |Stable Assert.|Const.|Ref. Statement |Stable Assert.|Const. | Ref. 
Tiye = S} TET LSi ie zry WS 
10 =8} LS2 y? =S} |aAzy |WS2 
a :=load x {[y]*, = 9} LS3 yi = S} cA#y |WSs 
{a=k} LS, || Storexv |fa= k} WS, 
Y |l’ LS5 y at x x yY WSs 
ag). = 8} FS; yqt. $ xA#y WS. 
fiu]? =S} |x Ay |FSe {ly.v'|=n} ja #yV IWS 
flush z f y at FS3 vv 
y rt FS4 { y = S} OS; 
y,vjl =n} FS, || flushopt æ |{ [y]? = S} OS» 
fonce {el = 8} SFSı {ly v| =n} OS3 
{|z, vl =n} SFS2 


Fig. 8: Selection of stable assertions for atomic statements executed by thread 7 


important component of the well-formedness condition is a non-emptiness con- 
dition on views, which states that [x]; 4 0A [z]P AA [z]4 4 0. For instance, a 
consequence of this condition is that, in combination with LP1, we have: 


{[ylr = {u}} a :=load x {[y], = {v}} (6) 


Worked Example We now return to the proof obligations from Example 3 and 
demonstrate how they can be discharged using the proof rules described above. 
For Local correctness, condition (1) holds by Conj (from Fig. 6) together with 
stability rules WS,, WS» and WS, (from Fig. 8) which establish the first three 
conjunctions in the postcondition, and SP; from Fig. 7, which establishes the 
final conjunction. Condition (2) holds by FP, in Fig. 7 together with Cons (from 
Fig. 6). Finally, condition (3) holds by WS» (from Fig. 8). 

Both the Stability conditions (4) and (5) from Example 3 hold by the stability 
rules in Fig. 8 together with Cons and Conj (from Fig. 6). In particular, for (4), 
we use rules LS;, LS» and LS4, and for (5), we use WS, WS2 and WS3. 


4 Examples 


In this section we present a selection of programs that we have verified in Is- 
abelle/HOL. These examples highlight specific aspects of Px86, in particular, the 
interaction between flushopt and sfence, as well as aspects of our view-based 
assertion language that simplifies verification. 


Optimised Message Passing We start by considering a variant of Fig. le, 
which contains two optimisations. First, we notice that flushing of the write to x 
in thread 1 can be moved to thread 2 since the write to z is guarded by whether 
or not thread 2 reads the flag y. Second, it is possible to replace the flush by a 
more optimised flush,,; followed by an sfence. We confirm correctness of these 
optimisations via the proof outline in Fig. 9. The optimised message passing 
in Fig. 9 ensures the same persistent invariant as Fig. le. However, the way in 
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{Vo € {x,y,z}, T € {1, 2}. [o]; = [o] 


= [o] 
{0 € [yl2 > (y, Diel = {1}) A ye 


*={0}} 
C {0,1} A jl = {0} } 


a := load y; 
= {(a=1 5 [e] = {1} [f = {0} 
ee? Eeo 
{i = {0} A {ek = (1) la = {0}} 
Ce {les Bae P = {0}} 
{true} {fæ} = eet 
store z 1; 
{[2]? = {0} v [a]? = {1}} 


{l° = {0} v ef = {1}} 
{4 : fel? = {1} = [2]? = {15} 


Fig. 9: Proof outline for optimised message passing 


which this is established differs. In particular, in Fig. le, the persistent invariant 
holds due to thread 1, whereas in Fig. 9 it holds due to thread 2. 

With respect to the persistent invariant, the most important sequence of 
steps takes place in thread 2 if it reads 1 for y. Note that by the conditional 
view assertion in the precondition of a :=load y, thread 2 is guaranteed to read 
1 for x after reading 1 for y. Thus, if the test of if statement succeeds, then 
thread 2 must see 1 for x. This view is translated into an asynchronous view 
after the flushopt is executed, and then to the persistent view after executing 
sfence. Note that until this occurs, we can guarantee that [z]P = {0}, which 
trivially guarantees the persistent invariant. 


Flush Buffering Our next example is a variation of store buffering (SB) and is 
used to highlight how writes by different threads on different locations interact 
with flushes. Here, thread 1 writes to x and flushes y, while thread 2 writes to y 
then flushes x. The writes to w and z are used to witness whether the flushes in 
both threads have occurred. The persistent invariant states that, if both w and 
z hold 1 in persistent memory, then either x or y has the new value (i.e. 1) in 
persistent memory. If both threads perform their flush operations, then at least 
one must flush value 1 since a flush cannot be reordered with a store. 

Although simple to state, the proof is non-trivial since it requires careful 
analysis of the order in which the stores to x and y occur. In the semantics of 
Cho et al. [9], the flush corresponding to the second store instruction executed 
synchronises with writes to all locations. Thus, for example, if thread 1’s store to 
x is executed after thread 2’s store to y, then the subsequent flush in thread 1 
is guaranteed to flush the new write to y. 

The above intuition requires reasoning about the order in which operations 
occur. To facilitate this, we use auxiliary variables â and Ô to record the order 
in which the writes to x and y occur; â = 1 iff the write to x occurs before the 


6 Note that the flush operations here are analogous to the load instructions in SB. 
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b= 1,2A lal? = {1}) V (4,6 =2, LABi = 0) 
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Fig. 10: Proof outline for flush aoe 


write to y, and â = 2 iff the write to x occurs after the write to y. Let us now 
consider the precondition of flush y (the reasoning for flush x is symmetric). 
There are two disjuncts to consider. 


— The first disjunct describes the case in which thread 1 executes its store 


before thread 2. From here, there is a danger that the thread 1 can terminate 
having flushed 0 for y. However, from this state, thread 2 is guaranteed to 
flush 1 for x before setting z to 1, satisfying the persistent invariant, as 
described by the second disjunct of each assertion in thread 2. 

The second disjunct describes the case in which thread 1 executes its store 
after thread 2. In this case, thread 1 is guaranteed to flush 1 for y, and this 
fact is captured by the conjunct [[y]l2 A[yl2 = {1} A [[y]{, which ensures that 
1) thread 2 sees the last write to y; 2) the only value visible for y to thread 2 
is 1; and 3) a flush performed by thread 1 is guaranteed to flush the last 
write to y. Note that by 1) and 2), we are guaranteed that the last write 
to y has value 1. We use these three facts to deduce that [y]? = {1} in the 
second disjunct of the postcondition of flush y using rule FP3. 


Epoch Persistency In our next example, we demonstrate how writes of dif- 
ferent threads on the same location interact with an optimised flush in the same 
location, as well as how the ordering of optimised flushes/loads alters the per- 
sistency behaviour. The crash invariant of Fig. 11 states that if z and y hold the 
value 1 in persistent memory then x has the value 2 in persistent memory. 

In order for thread 2 to read value 2 for x, the store of 2 at x must be 
performed before the store of 1 and [z]z = {1,2}. Establishing the persistent 
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{(Vr € {1,2}, 0€ {z,y, z}.[o], = fol? = {0})Aa= 0} 
{Iy]? = {0} A [2]? = {0} A (læ, 2| € {0,1})} 


store x 1; 


de [el = {1,2} Als, 2| =1A 
f (eea (a= ))| 


Nzi lz] 0) a:=loadz; 
lo =VA |e}, =O)V = ale = am aras 
(ezine om) {e722 e= abate = oae = 0) 


store x 2; flushopt z; 
ae [le r lzl$ = {2}) A fu]? = {0} A l]? = {0} 
Lone {told = {2} 4 yl? = {0} AE = {03} 
[x] © {0,1,2} hee A 
(i = 2}? = 0A e = {03} 
{lel = {2} v (vl? = {0} 
store z 1; 


{e}? = {2} v [yl]? = {0} v [2]? = {0}} 
{ [2]? = {2} v [u] = {0} v [2]? = {0} } 
{4 : aP = {DARP = {1} = [a]? = {23} 


Fig. 11: Proof outline for epoch persistency 


invariant for thread 2 requires reasoning about the view of thread 2 for address 
x (i.e. [x]2) after the execution of the instruction a :=load x. Notice here that 
a:=load x is ordered with respect to the later flushop_ x instruction. Conse- 
quently, any impact of the execution of the load on [z], will also affect [a]. 
Taking into account the ordering of the writes at the address x, we can conclude 
that if thread 2 reads the value 2, it reads the value of the last write at x. This 
is expressed with the assertion []x]]; in the precondition of a :=load x, which 
states that the threads 1’s view of x is the last write to x. By rule LP3, if a thread 
T’s view of an address x contains only the last write at this address, and the last 
value written at this address appears only once at the memory, then if a thread 
T read this value at x, its view of v (i.e. [x]-) is guaranteed to contain only the 
last written value at x. Consequently, after reading value 2, thread 2’s view of x 
contains only the value 2 (i.e. [z]2 = {2}). Execution of flush,pt x ensures [2]} 
(by rule OP). As a result, in the case that the if statement succeeds, after the 
execution of the sfence it is guaranteed that the value 2 is persisted at x (i.e. 
[z]? = {2}). In the case that the if statement fails, [y]? = {0} must hold, thus 
the persistent invariant holds trivially. 


5 PIEROGI Soundness 


In this section we present the Px86view model from [9] (§5.1), formally interpret 
our assertions as predicates on states of that model (§5.2), and establish the 
soundness of the proposed reasoning technique (§5.3). 
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(STORE) (LOAD-INTERNAL) 
(ASSIGN) a = store x e a =a :=load x 
a=a:=e v = T.regs(e) M{t] = (x := v) 
v = T.regs(e) M' =M + [(z:=v)] T.coh(x) = t 
T = T[regs(a) +> v] T’ =T|coh(x) 4 |M] T’ =T[regs(a) => v] 
(T, M) > (T', M) (T, M) 5 (T', M’) (T, M) = (T', M) 
(LOAD-EXTERNAL) 
a=a:=loadz regs(a) +> v, (SFENCE) 
M{t] = (z :=v) , coh(x) +> t, a = sfence 
TSE 
T.coh(x) < t VrNew Fou t, T’ = p |VPReady >u T.maxcoh, 
x g M (t..T.VrNew] VpReady © u t E VpCommit > u T.V pAsync 
(T, M) > (T', M) (T, M) > (T', M) 
(FLUSH) 
a = flush x (FLUSHOPT) 
T-T VpAsync() >u T.maxcoh, a = flushopt £ 
— VpCommit (£) >u T.maxcoh T= T [Vpasyne(@) >u T.coh(x) U T.VpReady] 
(T, M) = (T', M) (T, M) > (T', M) 


(PROGRAM-NORMAL) (PROGRAM-IF) 


pet) =i II(r,i) = a goto j pelr)=i II(7,i) = if B goto j else to k 
(T(r), M) S (T', M’) w=plro lÍ T(T).regs(B) = true 
pe’ = pelr => j] qT’ = Tir +> T] k T(rT).regs(B) = false 


P 
pt, T; M, G) => II (pe, Th M', G) (pt ; T, M, G) >i (pe, T M, G) 
(PROGRAM-GHOST) 
pelr) =i I(T, i) = (a goto j, â := é) 
(T(r), M) S (T', M’) 
pe=plroj) T =Tr>T] G=Garce 
(pt, T: M, G) >p (pe, T, M', G’) 


Fig. 12: Transitions of Px86yiew for a program H 


5.1 The Px86,\;cy Model 


Like previous view-based models, Px86,jew employs a non-standard memory cap- 
turing all previously executed writes, alongside so-called “thread views” that 
track several position(s) of each thread in that history and enforce limitations 
on the ability of the thread to read from and write to the memory. In addition, 
the thread views contain the necessary information for determining the possible 
contents of the non-volatile memory upon a system crash. Formally, Px86view’s 
memory and thread states are defined as follows. 


Definition 2 (Px86yiew’s memory). A memory M € MEMORY is a list of 
messages, where each message has the form (%:=v) for some x € Loc and 
v € VAL. We use w.loc and w.val to refer to the two components of a message 


254 E. Vafeiadi Bila, B. Dongol, O. Lahav, A. Raad and J. Wickerson 


w. We use standard list notations for memories (e.g. Mı ++ Mp for appending 
memories, [w] for a singleton memory, and |M| for the length of M). We refer 
to indices (starting from 0) in a memory M as timestamps, and denote the t’th 
element of M as M[|t]. We use U for obtaining the maximum among timestamps 
(i.e. tı U t2 = max(ty,t2)), and extend this notation pointwise to functions. We 
write x M (t2..t1] for the condition Vtg < t < tı. M[t].loc F z. 


Definition 3 (Px86view’s thread states). A thread state T € THREAD is a 
record consisting of the following fields: coh : Loc — N, vrnew : N, VpReady : N, 
VpAsync : LOC + N, and Vpcommit : LOC —> N. We use standard function /record 
update notation (e.g. T” = T[coh(x) + t] denotes the thread state obtained from 
T be modifying the x entry in the coh component of T to t). In addition, œu 
is used to incorporate certain timestamps in fields (e.g. T[vrNnew Hu t] denotes 
the thread state obtained from T be modifying the v,;jey component of T to 
T.ViNew Lt). We denote by T.maxcoh the maximum among the coherence view 
timestamps (T.maxcoh = [],, T.coh()). 


The two components, together with program counters and the “ghost mem- 
ory”, are combined in Px86yiew’s machine states as defined next. 


Definition 4 (Px86,yiew’s machine states). A machine state is a tuple o = 
(pe, T, M, G) where pe : TID — LAB is a mapping assigning the next program 
label to be executed by each thread, T : Tip > THREAD is a mapping assigning 
the current thread state to each thread, M € MEMORY is the current memory, 
and G : AUXVAR —> VAL is storing the current values of the auxiliary variables. 
Below we assume that G is extended to expressions ê € AUXEXP in a standard 
way. We denote the components of a machine state o by o.p, at. c.M, and o.G. 
In addition, we denote by ¢.maxpCommit(x) the maximum among the persistency 
view timestamps for location x (o.maxpCommit = | |, o.T (T).VpCommit(2))- 


The transitions of Px86,iey are presented in Fig. 12. These closely follow 
the model in [9] with minor presentational simplifications. Note, however, that, 
for simplicity and following [23], we conservatively assume that writes persist 
atomically at the location granularity (representing, e.g. machine words) rather 
than at the granularity of the width of a cache line. We refer the interested 
reader to [9] for a detailed discussion of the transitions rules in Fig. 12. 

The above operational definitions naturally induce a notion of a execution 
(or a “run”) of Px86yiew on a certain program JT starting from some initial state 
of the form (AT. 0, T,M, G). A system crash might occur at any point during the 
execution. Again, following the model of [9], the non-volatile memory (NVM) 
is not modeled as a concrete part of the state. Instead, the possible contents of 
the NVM can be inferred from the machine state (specifically from the memory 
and the Vpcommit Views of the different threads), as defined next. This definition 
is presented as “crash transition” in [9]. 


Definition 5. A non-volatile memory NVM : Loc —> VAL is possible in a state 
o if for every x € LOC, there exists some t such that o.M[t] = (x := NVM (x)) 
and x ¢ 0.M(t..o.maxpCommit(x)]. 


View-Based Owicki—Gries Reasoning for Persistent x86-TSO 255 


5.2 The Semantics of PIEROGI Assertions 


We present the formal definitions of the expressions introduced in §3.2 in terms 
of Px86 yiew’s machine states. 


Current and conditional views When formalising the current and condi- 
tional view expressions, we start with auxiliary functions that return the sets of 
observable timestamps visible to the components in question, then extract the 
values in memory corresponding these timestamps. To facilitate this, we define 


Vals(M, T'S) = {M[t].loc | t € TS} 
where M € MEMORY and T'S is a set of timestamps. 


Thread view To define the meaning of the thread view expression, [],, we use: 


TSO (0, x,t) * {t | o.M[t'].loc = x Ao.T(r).coh(x) < t Ax g o.M(t'..t]} 
TS,(o,x) & TSF (0, x, 0.T(T).ViNew) 


TS°F (ø, x,t) returns the set of timestamps that are observable from times- 
tamp t for thread 7 to read for location x in state øo; and TS,;(o,) returns the 
set of timestamps that are observable for T to read x in ø. Note that after in- 
stantiating t to 0.T(T).VNew in TSF (o, x,t), we obtain the premises of the load 
rules in Fig. 12. Then, [2], £ Ao. Vals(o.M, TS, (ø, 7)), i.e. is the set of values in 
a.M corresponding to the timestamps in TS, (ø, x). 

Persistent memory view For the persistent memory view expression, [æ]? 
we use: 


bi 


TSP (c, x) = {t | o.M[t].loc = 2 A z ¢ o.M(t..c.maxpCommit(z)]} 
which returns the set of timestamps that are observable to the persistent memory 
for x in ø. Then, [x]? = Ao. Vals(o.M, TSP (o, x)). Note that the second conjunct 
within the definition of TSP (ø, x) is precisely the condition that links Px86yiew 
states to NVM states (Definition 5). Given this definition, we have: 


Proposition 1. A non-volatile memory NVM : Loc —> VAL is possible in a 
state o iff NVM (x) € [x]? (o) for every x € Loc. 


Asynchronous memory view To define the meaning of the asynchronous 


memory view, [x], we use: 


TSA(c,2) £ {t | o.M[t].loc=rAx¢ o.M (t... T (T)Npasync(£)]} 
which returns the timestamps of the asynchronous view of thread 7 in location 
x and state o. Then, as before, |x] = Ao. Vals(o.M, TSA (ø, x)). 


Conditional view The functions used to define conditional memory view, 
(x, v)[y],, are slightly more sophisticated than those above. We define: 


U | ste TS,(0,2). 0. M{[t].val = v A 
TS’ (o, x, v) Ê t = if t = 0.T(r).coh(x) then o.T(T)VNew 
else tU o.T (T) ViNew 


TSS? (o, m, v, y) & {TSF (o, y, t) | t € TS (0, 2,0)} 
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where TSOY (ø, a, v) returns the set of timestamps that 7 can observe for x with 
value v. Assuming t is a timestamp that 7 can observe for x, and the value for x at 
t is v, the corresponding timestamp t that TSY (ø, x, v) returns is 0.T(T).V;New if 
T’s coherence view for x is t, and the maximum of t and o.T(T).ViNews otherwise. 
Given this, TSE? (ø, x, v, y) returns the timestamps that 7 can observe for y, from 
any timestamp t € TSY (ø, x, v). Finally, the set of conditional values is defined 
by (x, v)[y], = Ao. Vals(o.M, TSS? (ø, x, v, y)). 
Last view assertions We use the following auxiliary definition: 

Last(M,x) =| {t | M[t].loc = x} 
which returns the timestamp of the last write to x in M. Then, the last view 
assertions are given by: 


— fz], £ {o | TS,(c,x) = {Last(o.M, x)}}, i.e. 7’s view of x in ø is the last 

write to x ino. 

— [aE £ {0 | Last(o.M, x) < o.T(T).maxcoh U o.maxpCommit(.r)}, i.e. the max- 
imum of 7’s maximum coherence view and the maximum commit view of x 
(over all threads) is beyond the last write to x in ø. This means that exe- 
cuting a flush x operation in 7 will cause the last write of x to be flushed 
(see FLUSH rule in Fig. 12). 


Value count Finally, the value count expression is defined as follows: 
|x, v] = Ao. {t | c.M [t] = (x :=v)}| 


5.3 Soundness of PIEROGI 


Given the above building blocks, the soundness of the proposed reasoning tech- 
nique is stated as follows. 


Theorem 1 (Soundness of PIEROGI). Suppose that a program II has a 
valid proof outline (in, ann, I, fin). Leto be a state of Px86yiew that is reachable 
in an execution of IT from some state cin of the form (Ar. t, Tnit, Minit, Ginit) 
such that oinit € in. Then, the following hold: 


1) For every r € TID, we have that o € ann(T, 0.pc(r)). 

2) If o.pe(r) = ¢ for every r € TID, then o € fin. 

3) Every non-volatile memory NVM that is possible in o satisfies the crash 
invariant I. 


Finally, it is straightforward to show the soundness of a standard “auxiliary 
variable transformation” [30] which removes all auxiliary variables from a pro- 
gram JT (translating each command (a goto j,@ := ê) into a goto j) provided 
that the crash invariant and the final assertion do not contain occurrences of 
the auxiliary variables. Indeed, it is easy to see that the auxiliary memory G in 
the operational semantics in Fig. 12 serves only as an instrumentation, and does 
not restrict the possible runs. (Formally, if IZ’ is obtained from IT by removing 
all auxiliary variables and (pc, TM G’) is reachable in >, from some initial 
state, then (pc, T,M, G) is reachable in =; from the same state for some G.) 


View-Based Owicki—Gries Reasoning for Persistent x86-TSO 257 
6 Mechanisation 


Perhaps the greatest strength of our development is an integrated Isabelle/HOL 
mechanisation providing a fully fledged semi-automated verification tool for 
Px86yiew programs. This mechanisation builds on the existing work on Owicki— 
Gries for RC11 by Dalvandi et al [11,12] applying it to the Px86,iew semantics. 
We start by encoding the operational semantics of Cho et al. [9], followed by the 
view-based assertions described in §3.2. Then, we prove correctness of all of the 
proof rules for the atomic statements, including those described in §3.4. These 
rules can be challenging to prove since they require unfolding of the assertions 
and examination of the low-level operational semantics and their effect on the 
views of different system components. 

Once proved, the rules provided are highly reusable, and are key to making 
verification feasible. Specifically, when showing the validity of a proof outline 
(Definition 1), Isabelle/HOL generates the necessary proof obligations (after mi- 
nor interactions) and automatically finds the set of high-level proof rules needed 
to discharge each proof obligation via the built-in sledgehammer tool [6]. This 
enables a high degree of experimentation and debugging of proof outlines, includ- 
ing the ability to reduce assertion complexity once a proof outline is validated. 

The base development (semantics, view-based assertions, and soundness of 
proof rules) comprise ~7000 lines of Isabelle/HOL code. With this base devel- 
opment in place, each example comprises 200-400 lines of code (including the 
encoding of the program, the annotations, and the proofs of validity). The entire 
development took approximately 3 months of full-time work. 


7 Related Work 


The soundness of PIEROGI is proven relative to the Px86yiew of Cho et al. [9]; 
there are however other equivalent models in the literature [1, 23, 32,34], as well 
as other persistency models [33,35]. While the original persistent x86 semantics 
has asynchronous explicit persist instructions [34], the underlying model assumed 
here is due to Cho et al. [9] with synchronous persist instructions. Nevertheless, 
Khyzha and Lahav [23] formally proved that the two alternatives are equivalent 
when reasoning about states after crashes (e.g. using our “crash invariants”). 
As mentioned in §1, the only existing program logic for persistent programs 
is POG [31], which (as with PIEROGI) is a descendent of Owicki-Gries [30]. 
PIEROGI goes beyond POG by handling examples that involve flushopt instruc- 
tions, which cannot be directly verified using POG. Raad et al. [31] provide a 
transformation technique to replace certain patterns of flushopt and sfence with 
flush. Specifically, given a program J that includes flush,,; instructions, pro- 
vided that JI meets certain conditions, this transformation mechanism rewrites 
IT into an equivalent program I” that uses flush instructions instead, allowing 
one to use POG. However, there are three limitations to this strategy: 1) the 
rewriting is an external mechanism that requires stepping outside the POG logic; 
2) the rewriting is potentially expensive and must be done for every program 
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that includes flushopt; and 3) the transformation technique is incomplete in that 
not all programs meet the stipulated conditions (e.g. Epoch Persistency 2), and 
thus cannot be verified using this technique. PIEROGI has no such limitations, as 
we showed in the examples in Section 4. Moreover, POG has no corresponding 
mechanisation, and developing a mechanisation that also efficiently handles the 
program transformation for flushopt instructions would be non-trivial. 


The Owicki-—Gries method was first applied to non-SC memory consistency by 
Lahav et al. [26]. One way that their approach, which targets the release/acquire 
memory model, is different from ours is that they aim to use standard SC-like 
assertions; in order to retain soundness under a weak memory model, they had 
to strengthen the standard stability conditions on proof outlines. Dalvandi et 
al. [11,13] took a different approach when designing their Owicki—Gries logic 
for the release/acquire fragment of C11: by employing a more expressive, view- 
based assertion language, they were able to stick with the standard stability 
requirement. In our work, we follow Dalvandi et al.’s approach. However, our 
assertions are fine-tuned to cope with the other types of view present in Px86 view, 
such as those corresponding to the persistent and the asynchronous views. It is 
interesting that some of the principles of view-based reasoning apply to different 
memory models, and future work could look at unifying reasoning across models. 


Dalvandi et al. [13] have developed a deeper integration of their view-based 
logic using the Owicki-Gries encoding of Nipkow and Prensa Nieto [28] in Is- 
abelle/HOL. Such an integration would be straightforward for PIEROGI too, 
allowing verification to take place without translating programs into a transi- 
tion system. This would be much more difficult for POG since Owicki-Gries rules 
themselves are different from the standard encoding in Isabelle/HOL, in addition 
to the transformation required for flush,, instructions discussed above. 


The idea of extending Hoare triples with crash conditions first appeared in 
the work of Chen et al. [8]. However, that work supports neither concurrency 
nor explicit flushing instructions. Related ideas are found in the works of Ntzik 
et al. [29] and Chajed et al. [7]. However, in contrast to PIEROGI, both of these 
works 1) assume sequentially consistent memory, as opposed to a weak memory 
model such as TSO; 2) assume strict persistency (where store and persist orders 
coincide); and 3) assume there is a synchronous flush operation, which is easier 
to reason about than the asynchronous flushopt operation. 


Besides program logics, there have been other recent efforts to help program- 
mers reason about persistent programs. For instance, Abdulla et al. [1] have 
proven that state-reachability for persistent x86 is decidable, thus opening the 
door to automatic verification of persistent programs, and Gorjiara et al. [18] 
and Kokologiannakis et al. [25] have developed model checkers for finding bugs 
in persistent programs. Recent works have considered durable atomic objects 
such as concurrent data structures [17] and transactional memory [3] and their 
verification [3, 14,15], which have been designed to satisfy conditions such as 
durable linearizability [20,24] and durable opacity [3]. These proofs assume per- 
sistency under SC; our work provides foundations for extending these proofs to 
persistent x86-TSO. 
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Abstract. We study abstraction for crash-resilient concurrent objects 
using non-volatile memory (NVM). We develop a library-correctness 
criterion that is sound for ensuring contextual refinement in this set- 
ting, thus allowing clients to reason about library behaviors in terms 
of their abstract specifications, and library developers to verify their 
implementations against the specifications abstracting away from par- 
ticular client programs. As a semantic foundation we employ a recent 
NVM model, called Persistent Sequential Consistency, and extend its 
language and operational semantics with useful specification constructs. 
The proposed correctness criterion accounts for NVM-related interac- 
tions between client and library code due to explicit persist instructions, 
and for calling policies enforced by libraries. We illustrate our approach 
on two implementations and specifications of simple persistent objects 
with different prototypical durability guarantees. Our results provide the 
first approach to formal compositional reasoning under NVM. 


Keywords: Non-volatile memory - Linearizability - Library abstraction 


1 Introduction 


Non-volatile memory, or NVM for short, is an emerging technology that enables 
byte addressable and high performant storage alongside with data persistency 
across system crashes. This combination of features allows researchers and prac- 
titioners to develop a variety of efficient crash-resilient data structures (see, e.g., 
[14,32]). Recently, NVM has started to become available in commodity architec- 
tures of manufacturers such as Intel and ARM [4,23], and formal (operational 
and declarative) models of these systems have been proposed [10, 25, 30]. 
Unfortunately, like other new technologies, NVM puts more burden on pro- 
grammers. Indeed, to get close to the performance of DRAM, writes to the NVM 
are first kept in volatile (i.e., losing contents upon crashes) caches, and only later 
persist (i.e., propagate to the NVM), possibly not in the order in which they were 
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issued. This results in counterintuitive behaviors even for sequential programs 
and requires careful management using barriers of different kinds, a.k.a. explicit 
persist instructions, for guaranteeing that the system recovers to a consistent 
state upon a failure. Combined with standard concurrency issues, programming 
on such machines is highly challenging. 

To tackle the complexity and make NVM widely applicable, one would natu- 
rally want to draw on libraries encapsulating highly optimized concurrent crash- 
resilient data structures (a.k.a. persistent objects). This approach goes both 
ways: programmers should be able to reason about their code using abstract 
library specifications that hide the implementation details, and in turn, library 
developers should be able to verify “once and for all” their implementations 
against their specifications abstracting away from a particular client program. 
From a formal standpoint, this indispensable modularity requires us to have a 
so-called (library) abstraction theorem: a correctness condition that guarantees 
the soundness of client reasoning that assumes the specification instead of the 
implementation. Put differently, the abstraction theorem should allow one to es- 
tablish contextual refinement, i.e., conclude that the specification reproduces the 
implementation’s client-observable behaviors under any (valid) context. To the 
best of our knowledge, while several correctness criteria for persistent objects, 
akin to classical linearizability, have been proposed and have been established for 
multiple sophisticated implementations, none of them has been formally related 
to contextual refinement by an abstraction theorem of this kind. 

In this paper we formulate and prove an abstraction theorem for concurrent 
programs utilizing non-volatile memory. We target the “Persistent Sequential 
Consistency” model of [25], or PSC, which enriches the standard sequentially con- 
sistent shared-memory with non-volatile storage using per-location FIFO buffers 
to account for delayed and out-of-order persistence of writes. PSC constitutes a 
relatively simple model that is very close to developer’s informal understanding 
of NVM. While existing hardware does not implement PSC as is, [25] presented 
compiler mappings from PSC to x86 (based on its persistency model from [30]), 
which can be used to ensure PSC semantics on Intel machines. Directly support- 
ing relaxed memory models is left for future work. 


Auxiliary material. An extended version, including proofs of theorems stated 
in the paper, is available at https://arxiv.org/abs/2111.03881. 


2 Key Challenges and Ideas 


We outline the main challenges and the key ideas in our solutions. We keep the 
discussion informal, leaving the formal development to later sections. 


2.1 Library Specifications 


A choice of a formalism for specifying library behaviors is integral in stating a li- 
brary abstraction theorem. For libraries of concurrent data structures (a.k.a. con- 
current objects), a popular approach is to give specifications in terms of sequen- 
tial objects with the help of the classical notion of linearizability [21], which 
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requires every sequence of method calls and returns that is possible to produce 
in a concurrent program to correspond to a sequence that can be generated by 
the sequential object. In this approach, a sequential object, represented by a 
set of sequences of pairs of method invocations and their associated responses, 
constitutes the library specification. Then, abstraction allows the client to rea- 
son about calls to a concurrent library as if they execute atomically on a single 
thread, or, equivalently, protected by a global lock [7,13]. 


For libraries of crash-resilient objects, there is more than one natural way of 
interpreting sequential specifications and adapting the linearizability definition, 
and no single notion of correctness w.r.t. sequential specifications captures all 
different options. A crash-resilient object may ensure that all methods completed 
by the moment of crash survive through it, or that some prefix of them does. It 
may also choose different possibilities for methods in progress at the moment of 
crash (whether they are allowed take their effect at some later point after the 
crash or not). Multiple adaptations of linearizability have been proposed, each 
relating crash-resilient objects to sequential specifications in a different way. This 
includes: strict linearizability [3], persistent atomicity [19], and durable lineariz- 
ability and its buffered variant [24]. Among them, buffered durable linearizability, 
which allows for efficient implementations, ended up not being compositional, 
which means that it may happen that two (non-interacting) libraries are both 
correct, but their combination is not. In fact, since each of the different notions 
is useful for particular objects, one may naturally want to mix different correct- 
ness notions in a single client program. This would force the client to reason 
with several alternatives for interpreting sequential specifications, and to make 
sure that they compose well with one another. 


To approach this variety, we believe it is necessary to follow a different ap- 
proach, which is standard in concurrent program verification (see, e.g., [18, 20, 
26]), and was applied before for deriving abstraction theorems in different con- 
texts [8, 16,17]. The idea is to take a library’s specification to be just another 
library, where the latter is intended to have a simpler implementation. Then, 
we define a library correctness condition stating what it means for one library 
L to refine another library L* (equivalently, for L* to abstract L), and prove an 
abstraction theorem that ensures that when the library correctness condition is 
met, the behaviors of any client using L are contained in the behaviors of the 
client using L*. Such a theorem is only useful if the correctness condition avoids 
quantification over all possible clients, which would make the theorem trivial. 


Using code for specifying libraries has several advantages over correctness 
notions based on sequential specifications. First, specifications and implementa- 
tions are expressed and reasoned about in a unified framework, alleviating the 
need to interpret the use of sequential specifications by concurrent programs with 
system failures. Instead, the client of the theorem replaces complex library code 
with simpler specification code, and thus works with the semantics of a single 
language. Second, it enables a layered verification technique for library devel- 
opers, allowing them to prove library correctness by introducing one or more 
intermediate implementations between L and L*. Finally, this formulation of 
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the abstraction theorem is compositional (a.k.a. local) by construction, meaning 
that objects can be specified and verified in isolation. 

Now, “code as a specification” is only useful if the programming language is 
sufficiently expressive for desirable specifications. For concurrent objects, “atomic 
blocks”, often included in theoretic programming languages, provide a handy 
specification construct. For NVM, one needs a way to govern the persistence 
similarly, offering intuitive specifications for libraries that simplify client reason- 
ing. For that matter, viewing the out-of-order persistence of writes to different 
cache lines as the major source of counterintuitive behaviors in NVM, we propose 
a new specification construct, which we call persistence blocks. Roughly speak- 
ing, such blocks may only persist in their entirety, so that persistence blocks 
ensure an “all-or-nothing” persistency behaviors to the writes they protect. 

For example, when recovering after a crash during a run of the tiny program 
x := 1; ý := 1,! due to out-of-order persistence (writes to different cache lines 
are not guaranteed to persist in the order in which there were issued), we may 
reach any combination of values satisfying x € {0,1} Ay € {0,1}. In turn, if a 
persistence block is used, as in beginPB(x, ýy); x := 1; y := 1; endPB(x, ý), then 
only x= y = 0 V š = y = 1 are possible upon recovery. 

Our blocks are closely related to persistent transactions of the PMDK li- 
brary [22] (but we avoid the term transaction, since persistence blocks do not 
ensure isolation when executed concurrently). In our technical development, we 
extend the PSC model with instructions for persistence blocks, and carefully con- 
struct their semantics (see §4.2) to allow the abstraction result. We believe that 
persistence blocks are a useful specification construct for various data structures, 
where data consistency naturally involves multiple locations (often, pointers) be- 
ing in-sync with one another. 


2.2 Client-Library Interaction Using Explicit Persist Instructions 


The key to establishing a library abstraction theorem is in decomposing a pro- 
gram into two interacting sub-parts, a client and a library, and understanding 
the interactions between them. These interactions are usually defined in terms 
of histories, taken to be sequences of method invocations and responses, along 
with the values being passed. The library correctness condition (the premise of 
the abstraction theorem) requires that histories produced by using a library L 
are also produced by its specification L* when both libraries are used by a cer- 
tain “most general client” (MGC, for short) that concurrently invokes arbitrary 
methods of L an arbitrary number of times with every possible argument. The 
abstraction theorem ensures that if the library correctness condition holds, then 
L refines L* for any client. 

Thus, for the abstraction theorem to hold, one has to make sure that the 
interactions between any client and the library are fully captured in the his- 
tory produced by the library when used by the MGC. In crash-free sequentially 


1 We use “overdots” to denote non-volatile variables. We assume that all variables are 
initialized to 0 and that x and y lie on different cache lines. 
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consistent shared memory semantics, this is ensured by the standard assump- 
tion that the client and the library manipulate disjoint set of memory locations. 
Indeed, this restriction guarantees that clients can communicate with libraries 
only via values passed to and returned from method invocations. 


However, we observe that under NVM, mutual interactions between the client 
and the library go beyond passed values, even when assuming disjointness of 
memory locations, which makes the standard notion of a library history in- 
sufficient. As a simple example, consider an interface with just one method 
f, specified by L* = [f > sfence; return]. The sfence instruction, called 
“store fence”, is an explicit persist instruction meant to be used in conjunc- 
tion with optimized barriers called “flush-optimal” (denoted by fo). Its role is 
to guarantee the persistence of previous write instructions that are guarded by 
flush-optimal instructions. Concretely, under PSC (following x86), after a thread 
executes x := 1; fo(x); sfence, we know that the write of 1 to & has persisted 
(i.e., been propagated to the NVM), while without the sfence, it may still sit 
in the volatile part of the memory system. 


In turn, consider an implementation L, given by L = [f +> return], that 
implements f by doing nothing. Clearly, L does not implement L* correctly. 
Indeed, for the (sequential) client program x := 1; fo(x); call(f); y := 1 that 
uses L*, we have y = 1 = x = 1 as a global invariant: if the system has 
crashed and we have y = 1 in the NVM, then the sfence ensures that x = 1 
is in the NVM as well. Nevertheless, due to out-of-order persistence, if we use 
L in this program, we may get y = 1Ax = 0 after a crash. Now, the client 
and the libraries above mention disjoint locations, and the histories that L may 
produce for the MGC are exactly the histories that L* produces (all well-formed 
sequences of “call” and “return”). Thus, when inspecting histories of L and of 
L*, we do not have sufficient information to observe the difference between them. 


Generally speaking, the challenge stems from the fact that certain explicit 
persist instructions (sfence and other instructions whose implementation in the 
hardware contains an implicit store fence, such as RMWs in x86), which can be 
executed by the library, impose conditions on the persistence of writes performed 
by the client that ran earlier on the same processor. 


We address this challenge in two ways. First, we can sidestep the problem 
by weakening the semantics of store fences, making them relative to a set of 
locations (those used by the library or those used by the client). To do so, we ex- 
tend the programming language with a specification construct similar to a store 
fence, but only affecting a given set of locations, and we restrict its use by each 
component to mention only the locations it owns. The use of these localized 
instructions instead of store fences is sufficient to ensure that the interaction 
between client and library is fully captured in histories, and allows us to estab- 
lish the expected abstraction theorem. Libraries that do not intend to provide 
a store fence functionality to their clients can readily replace store fences with 
their localized counterparts. Doing so gives more freedom to alternative imple- 
mentations of the same specification, which may, e.g., use alternative persist 
instructions without the store fence functionality (such as CLFLUSH in [23]). 
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On the other hand, it is possible that in performance-critical systems, clients 
would like to rely on a store fence that is executed anyway by the library for the 
library’s own needs. For that, the library developer needs to use a standard store 
fence in the library’s specification rather than the localized counterpart, and the 
abstraction theorem has to handle store fences with their standard, non-localized 
semantics. To do so, we expose in histories not only method invocations and 
responses, but also store fences. Roughly speaking, it means that in addition to 
the standard requirement on values passed by method invocations and responses, 
for L to refine L*, we would also require that L performs a store fence whenever 
L* does (which does not hold for the example above). Our notion of history in §5 
is set to allow store fences (alongside with their weaker localized versions), and 
the abstraction theorem in §6 shows that these extended histories are expressive 
enough for defining the library-correctness condition. 


2.3 Handling Calling Policies 


The third challenge we address concerns abstraction for libraries that enforce 
certain calling policies on their clients.” For instance, a library implementing a 
lock may require that the calls of each thread for acquiring and releasing the 
lock perfectly interleave, and a library implementing a single-producer queue 
may require that only one thread is calling the enqueue method. In the context 
of NVM, libraries often demand that a distinguished recovery method is called 
after every crash before invoking any other method of the library. When the client 
uses the library in a way that violates the calling policy, the library developer 
ensures nothing, and the blame is assigned to the client. 

In the presence of calling policies, the contextual refinement guaranteed by 
the library abstraction theorem, stating that all behaviors of a program Pr|L] 
that uses L are also behaviors of the program Pr{L*] that uses L*, is only appli- 
cable for a program Pr that respects the calling policy. An interesting compo- 
sitionality question arises: Are we allowed to assume the library’s specification 
when checking whether a program adheres to the calling policy (that is, require 
that Pr[L*] adheres to the policy), or should this obligation be satisfied for the 
library’s implementation (that is, require that Pr|L] adheres to the policy)? 

The latter option would limit the applicability of the abstraction theorem 
for client reasoning. Indeed, it may be the case that establishing that Pr|L] 
adheres to the policy depends on the implementation L, whereas the abstraction 
theorem should allow reasoning without knowing the implementation at all. On 
the other hand, the former option seems circular, as it uses contextual refinement 
to establish its own precondition. 

In this paper we show that requiring that Pr[L*] adheres to the policy is 
actually sufficient for ensuring contextual refinement. Roughly speaking, our 
proof avoids circular reasoning by inspecting a minimal contextual refinement 
violation, for which we are able to establish policy adherence when using L, given 


? This challenge is not particular to NVM, but, interestingly, to the best of our knowl- 
edge, it has not been addressed in previous work establishing abstraction theorems. 


268 A. Khyzha and O. Lahav 


policy adherence when using L*. To the best of our knowledge, this is a novel 
argument in the context of library abstraction. It is akin to DRF (data-race 
freedom) guarantees in weak memory concurrency, where often programs are 
guaranteed to have strong semantics (usually, sequential consistency) provided 
that certain race-freedom conditions hold in all runs under the strong semantics. 

We note that many library’s calling policies are “structural”, namely they 
only enforce certain ordering constraints on the clients that do not depend on 
the values returned by the library (in particular, “execute recovery first” is 
a structural policy). In these cases, policy adherence holds even for an over- 
approximation Lstup of L that returns arbitrary values. Certainly, however, this 
is not always the case. For example, a library L implementing standard list meth- 
ods, cons and head, may require that head is only called on non-empty lists (like, 
e.g., pop_front in C++ that triggers undefined behavior if applied to an empty 
list [1]). Then, invoking head with the value returned from cons does adhere 
to the calling policy, but this is not the case for the over-approximated library 
Lstub, Which allows cons to return the empty list. 


3 NVM Programs: Syntax and Semantics 


In this section we begin to present the formal settings for our results. As standard 
in memory models, it is convenient to break the operational semantics into: 
a program semantics (a.k.a. thread subsystem) and a memory semantics. We 
represent both components as labeled transition systems whose transition labels 
correspond to the operations they perform. We then consider the synchronized 
runs of the program and the memory, where program actions that interact with 
the memory are matched by actions executed by the memory system (see §4.1). 

Next, we focus on the program part of the semantics, presenting both syntax 
(§3.1) and semantics (§3.2). We use the following standard notations. 


Notation for finite sequences. For a finite alphabet X, we denote by %”* 
(respectively, X*) the set of all (non-empty) sequences over X. We use € to 
denote the empty sequence. The length of a sequence s is denoted by |s|. We often 


identify sequences with their underlying functions (whose domain is {1,...,|s|}), 
and write s(k) for the symbol at position 1 < k < |s| in s. We write o € s 
if ø appears in s, that is if s(k) = o for some 1 < k < |s|. We use “-” for 


concatenating sequences, and identify symbols with sequences of length 1. 


3.1 Program Syntax 


The domains and metavariables used to range over them are as follows: 
values v,u € Val = {0,1,2,...} 
shared non-volatile variables «,y € NVVar = {x,y,...} 
shared volatile variables %Z,y € VVar = {x,y,...} 
shared variables x,y € Var = NVVar U VVar 
register names r € Reg = {a,b,...} 
thread identifiers r,r € Tid = {T1,To,...,Tn} 
method names f EF main ¢ F 
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Thus, there are three kinds of variables: shared non-volatile, shared volatile, and 
thread-local ones (called registers), which are also volatile. A distinguished name 
main is reserved for the starting point of the program execution. 

For concreteness, we present a simple programming-language syntax. Its ex- 


pressions and instructions are given by the following grammar:® 
ex= r|vilete|le=e|eFfel| .. 
inst ::= r :=e | if egoton,!...1Nm | havoc | w:=e | r:=a2 


| £1(¢) | fo(@) | sfence | call(f) | return 


| lsfence(X) | beginPB(X) | endPB(X) 


Expressions are constructed with arithmetic and boolean operations over 
registers and values. Instructions consist of a local assignment r := e; a condi- 
tional if e goto nı 1... | Nm for non-deterministically jumping to a program 
counter from {n1,...,%m} when e evaluates to non-zero or, otherwise, skipping 
(goto nı 1... 1 Nm can be encoded as if 1 goto nı 1... 1 Nm); havoc for arbitrar- 
ily modifying all registers; a write to memory x := e; and a read from memory 
r := x. There are also explicit persist instructions: a flush instruction f1(#) and 
its optimized version fo(«), called flush-optimal (referred to as CLFLUSH and 
CLFLUSHOPT in [23]), as well as the store fence instruction sfence (see §2.2). 

This standard instruction set is extended to support calling and specifying 
library methods. There is a call instruction call(f) and a return instruction 
return. The novel specification constructs include the local store fence instruc- 
tion lsfence(X ) that relaxes the semantics of sfence by only enforcing the 
persistence ordering for the given set X of variables (thus, 1sfence(NVVar) is 
equivalent to sfence); and instructions to begin and end a persistence block, 
beginPB(X) and endPB(X), respectively. The persistence block demarks the 
writes that need to persist simultaneously after the block ends, either non- 
deterministically or triggered by a flush on some variable in X. 

Next, we employ three syntactic categories: 

e Instruction sequences represent the (sequential) implementation of each method 
(including main). Formally, an instruction sequence J is a function from a non- 
empty finite domain of the form {0,...,n} (representing the possible program 
counters) to the set of instructions. We say that an instruction sequence is 
flat if it does not include an instruction of the form call(_). 

e Sequential programs consist of a “main” method accompanied with imple- 
mentations of every method f € F. Formally, a sequential program S$ is a 
function assigning an instruction sequence to every f € {main} UF. To avoid 
modeling a call stack and simplify the presentation, we require that S(f) is a 
flat instruction sequence for every f € F. 

e Concurrent programs are top-level parallel compositions of sequential pro- 
grams, all accompanied by the same method implementations. Formally, a 
(concurrent) program Pr is a mapping assigning a sequential program to ev- 
ery T € Tid, with Pr(r)(f) = Pr()(f) for every 7,7 € Tid and f € F. Below, 
we write Pr(f) for Pr(T1)(f). 


3 In the extended version of this paper, we also include read-modify-write instructions. 


270 A. Khyzha and O. Lahav 


3.2 Program Semantics 
We give semantics to the syntactic objects using labeled transition systems. 


Definition 1. A labeled transition system (LTS) is a tuple A = (X, Q, init, T), 
where X is a set of transition labels, Q is a set of states, dnit E€ Q is the initial 
state, and T C Q x X xQ is a set of transitions. We often write q Z q’ to denote 
a transition (q,0,q’). We denote by A.®, A.Q, A.qinit, and A.T the components 
of an LTS A. We write 25,4 for the relation {(q,q') | q Z qd € A.T} and >, 
for Uses =a . For a sequence t € A.d*, we write +,,4 for the composition 


10). tame AH), a . A sequence t € A.>* such that A.qmit Ž, 4 q for some 


q € AQ is called a trace of A. We denote by traces(A) the set of all traces of A. 
A state q € A.Q is called reachable in A if A.qinit +, q for some t € traces(A). 


Next, we define the LTSs induced by instruction sequences, sequential pro- 
grams, and concurrent programs. We will often identify the syntactic objects 
with the LTS they induce (e.g., when writing expressions like S.Q for a sequen- 
tial program S). The transition labels of these LTSs feature action labels. 


Definition 2. An action label takes one of the following forms: a read R(x, v), a 
write W(x, v), a flush FL(é), a flush-opt FO(), an sfence SF, a local sfence LSF(X 
a start beginPB(X) or an end endPB(X) of a persistence block, a call CALL(f, œ), 
or a return RET(f, ), where x € Var, v € Val, £ € NVVar, X C NWar, f € F, 
and @: Reg + Val. We denote by Lab the set of all action labels. The functions 
typ and var retrieve (when applicable) the type (R/W/...) and variable (x or ¢) 
of an action label. We write varset(l) for the set of variables mentioned in l 
(e.g., varset(R(x,v)) = {x}, varset(LSF(X)) = X, and varset(SF) = 0). 


? 


RN 


Action labels represent the interactions that a program has with the memory. 


Definition 3. The LTS induced by an instruction sequence J is given by: 

e The transition labels are action labels, extended with e€ for silent transitions. 

e The states are pairs (pc, œ} where pc € N, called program counter, stores the 
current instruction pointer inside the sequence, and ¢ : Reg — Val, called 
local store, records the values of the registers. We assume that local stores are 
extended to expressions in the obvious way. 

e The initial state is (0, dint), where dinit = Ar.0. 

e The transitions are as follows: 


I(pc) = if e goto ni |... I Nm 
I(pc) =r:=e ole) #0 => pe € {m,...,.nm} 
= dr oe] de) = 0 => pel = pet 1 I(pe) = havoc 
(pe, p) Sr (pe + 1,9’) (pe, P) “sr (pe', p) (pc, 6) Sr (pe + 1,4’) 


£1(.), fol) 
(pce) =r:=2 I(pce) € $ sfence, 1sfence(_), 
I(pc) =a :=e l = R(x, v) beginPB(_), endPB(_) 

l = W(x, d(e)) ¢ = lr = v] l = matching-label(I(pc)) 


(pc, p) +1 (pe +1,) (pe, p) “+1 (pe +1, 4’) (pe, 6) “+1 (pe + 1, 4) 
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Recall that program semantics is separate from memory semantics, which 
is why the transitions above completely ignore the restrictions arising from the 
memory system. In particular, the write to memory z := e only announces 
itself in the label. The read from memory r := x loads an arbitrary value v 
into the destination register r, announcing that value in the read label. Other 
instructions act as no-ops, and simply announce themselves in the transition 
label, using the function matching_label that maps each instruction to its label 
(£1(a%) > FL(t), fo(t) + FO(«), and so on). 

Finally, call(f) and return instructions are not handled in this level, but 
receive special semantics at the level of sequential programs, as defined next. 


Definition 4. The LTS induced by a sequential program S is given by: 
e The transition labels are action labels, extended with e€ for silent transitions. 
e The states are tuples q = (pc, ¢, pc,, f}, where: 

— (pc, ¢) is a state of the instruction sequence (see Def. 3) storing the state 
of the sequence currently running. 

— pe, ENU {L}, called the stored program counter, is used to remember the 
program position to jump to when the current instruction sequence returns, 
whereas pc, = L means that the main method is currently running. (Recall 
that we assume that S(f) is flat for every f € F, so we do not need to record 
the call stack.) 

— f € FU{main}, called the active method, tracks the method that is currently 
running. 

We denote by q.pc, g.¢, q-pcs, and q.f the components of a state q € S.Q. 

e The initial state is (0, init, L, main). 
e The transitions are given by: 


NORMAL 


le E€ LabU {e} f € {main} UF iad S(main)(pc) = call(f) 
(pe, p) rsi) (Pe, g’) | = CALL(f, 4) 
(pe, Q, pcg, f) a (pe’, b', pc, f) (pe, b, L, main) Ls (0, ġ, pe + 1, f) 
RETURN NON-DET-SFENCE 
S(f)(pce) = return l = RET(f, ¢) l= SF 
(pe, dQ, peg, f) hg (pe, d, d main) (pe, $, peg, f) 45 (pe, Q, pes, f) 


The NORMAL transition lifts the instruction-sequence transition to the level 
of sequential programs. Note that the transition applies for any method (main or 
other). The CALL transition passes control from the main method to some other 
method, jumping the program counter to the first instruction and storing the 
return point (pc+1). The RETURN transition passes control back using the stored 
return point. For simplicity, we do not have any argument passing mechanism 
and use the full register store for that matter. (If needed, each component may 
store the values it needs in the memory, and reload them later on.) 

Finally, NON-DET-SFENCE is a non-standard transition that we find techni- 
cally convenient to have. It allows the program to non-deterministically execute 
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an sfence at any point. Since, as will become apparent when presenting the mem- 
ory system, sfences only restrict the possible behaviors, this transition is safe to 
include in the program semantics. It is particularly useful for simplifying the li- 
brary correctness condition that only considers inclusion of sets of histories (see 
§5). For instance, switching the roles of L and L* from §2.2, the library imple- 
menting f using sfence should be considered a refinement of the one that simply 
returns. For that, we allow the no-op specification to perform non-deterministic 
sfences that match the ones executed by the concrete implementation. 
Finally, the LTS induced by a concurrent program is defined as follows. 


Definition 5. The LTS induced by a (concurrent) program Pr is given by: 

e The set of transition labels is given by (Tid x (LabU {e})) U{¥ }. The functions 
on action labels (e.g., typ, var) are lifted to these labels in the obvious way. 

e The states, denoted by q, assign a state in Pr(r).Q to every 7 € Tid. 


e The initial state is composed from the initial state of each thread: 
def 


dinit = (Pr(T1)-dinit, vv’ (IN) Quiet) 
e The transitions are interleaved thread transitions or crash transitions reini- 
tializing the program state: 


= le 
LelabU{e} a(r) pri d 
NORMAL CRASH 


— Til. = š — 4 ae 
G—Spr qlr => q'] T > Pr Tnit 


4 The PSC Memory System 


We present PSC (“Persistent Sequential Consistency”), the persistency model 
used as the memory system. We first introduce the model as it is in [25] (extended 
with standard volatile memory alongside with the non-volatile one), following 
its operational presentation as an LTS with non-deterministic memory-internal 
transitions that flush stores from the volatile part to the non-volatile part. In 
§4.1, we define the synchronization of programs with the PSC memory system. 
In §4.2, we present the extensions added in this paper that are useful for library 
abstraction. Finally, in §4.3, we establish certain separation properties of PSC 
that are essential in our proofs. 

Roughly speaking, a state in PSC consists of a non-volatile memory (map- 
ping from non-volatile variables to values) and a volatile memory (mapping from 
volatile variables to values). The volatile memory works just as a normal sequen- 
tially consistent memory, keeping track of the latest written value to every vari- 
able and returning that value for reads. Upon crash, the contents of the volatile 
memory is reset to its initial state. The non-volatile memory behaves observa- 
tionally the same between crashes, but its contents survive crashes. To model de- 
layed and out-of-order persistence of writes, write steps to non-volatile variables 
do not alter the non-volatile memory immediately when issued. Instead, writes 
first go to volatile per-variable persistence FIFO buffers, which maintain the 
writes to each variable that are yet to persist. Then, PSC non-deterministically 
takes persist steps that apply the oldest update from a persistence buffer in the 
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non-volatile memory. Reads from non-volatile variables retrieve the latest value 
in the relevant buffer, or the value from the non-volatile memory if that buffer is 
empty, thus providing standard sequentially consistent semantics in the absence 
of system crashes. Upon crash the buffers are reset to their initial (empty) state, 
but the contents of the non-volatile memory remains intact. 

Explicit persist instructions can be used to control the persistence of writes. 
A “flush” barrier for a certain variable blocks the execution until the relevant 
persistence buffer is empty, thus forcing all previous writes to that variable to 
persist. Alternatively, a (cheaper) “flush-optimal” barrier for a certain variable 
enqueues a special marker in the persistence buffer of this variable accompanied 
by the thread identifier of the thread that issued the barrier. The effect of flush- 
optimal is delayed until the same thread performs an sfence, which blocks the 
execution until all flush-optimal markers of that thread are dequeued from all 
buffers. The fact that the persistence buffers are FIFO ensures that an sfence by 
some thread forces the persistence of all writes executed before a flush-optimal 
issued by the same thread. 


Definition 6. PSC is the LTS defined as follows: 

e The transition labels are given by (Tid x Lab) U{per, 4}. That is, a transition 
label can be a pair of the thread identifier and the action label of the operation, 
per denoting the internal propagation action, or 4 denoting a system crash. 

e The states are tuples M = (m,m, P), where: 

— m:NVVar —> Val is called the non-volatile memory. 

—m:VVar — Val is called the volatile memory. 

— P : NVVar — PLBuff is called the persistence buffer. Here, PLBuff denotes 
the set of all per-location persistence buffers, each of which is a finite se- 
quence p of entries of the form W(v) for v € Val (writes), or FO(r) for r € Tid 
(flush optimal markers). The persistence buffer P assigns a per-location 
persistence buffer to every non-volatile variable.* 

We denote by M.m, M.m, and M.P the components of a state M € PSC.Q, and 

write M[X++ Y] for the state obtained from M by setting M.X to Y. 

e The initial state is Minit = (Thnit, Mit, Pinit), where Tnit = At. 0, Minit = 
AF.0, and Pini = At. €. 

e The transitions of PSC are presented in Fig. 1, using an auxiliary function 
for looking up the most recent value of a variable: we let M(x) be M.m(x) for 
x € VVar, and, for x € NVVar, either the value v of the last write (rightmost) 
entry M.P(x) or, when there is no such entry, M.m(z). 


The transitions follow the intuitive account above. Those corresponding to 
program transitions are labeled with pairs in Tid x Lab. For instance, a transition 
labeled with (7,R(a,vg)) means that thread 7 reads the value vg from (volatile 
or non-volatile) shared variable x. 


4 We conservatively assume that writes persist at the location granularity, rather than 
at the cache-line granularity as happens in real machines. 
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V-WRITE NV-WRITE READ 


l= W(ž, v) l = W(a, v) l = R(x, v) 
m = M.m[% = v] p' = M.P(#) - W(v) P’ = M.P[è => p'] M(x) =v 
M Z} psc Mii Mm'] M Z} psc M[P > P'] M 2 psc M 
FLUSH FLUSH-OPT SFENCE 
l= FL(#) l = F0(ċ) l= SF 
M.P(t) =€ p' = M.P(«) - FO(r) P' = MPlé + p'] Vt. FO(T) g M.P(è) 
M Zh pese M M Z}psc M[P > P] M DeM 
PERSIST-WRITE PERSIST-FO 
l = per M.P(t) =W(v)- p l = per M.P(“) = FO(T)-p CRASH 
P' = MP|i pl rh! = Mafi = o] P’ = MP|t => p] l=4 
M b psc Mitt th’, PH P'] M psc M[P > P'] M +ypsc Miniti => M.i] 


Fig. 1. Transitions of PSC 


4.1 Linking Programs and Memories 


To give semantics of programs running under PSC, the thread system is synchro- 
nized with the PSC memory system. Formally, the synchronization of a program 
Pr with PSC, is another LTS, denoted by PrxPSC, defined as follows: 

e The set of transition labels is Pr.SXUPSC.®, i.e., (Tid x (LabU{e}))U{per, ¢ }. 
e The states are pairs (q, M} € Pr.Q x PSC.Q. 

e The initial state is (init; Minit)- 

e The transitions are given by: 


SYNCHRONIZED PROGRAM-INTERNAL MEMORY-INTERNAL 
a € (Tid x Lab) U {4} a € Tid x {e} a = per 
GT +p 7 M *spsc M’ G>pr@ M & psc M’ 
(q, M) S prxpsc (q, M’) (q, M) S prxpsc (Q, M) (q, M) -S prxpsc (q, M”) 


The above transitions are “synchronized transitions” of Pr and PSC, using the 
labels to decide what to synchronize on. Both the program and the memory 
take the same step for transition labels that are common to both LTSs, only the 
program steps for transition labels that are only program transitions, and only 
the memory steps for transition labels that are only memory transitions. 


4.2 Extending PSC for Library Abstraction 


We present the modifications of PSC for supporting the new specification con- 
structs: localized sfences and persistence blocks. When referring to PSC in the 
sequel we mean the following revised version. 


Local store fences. Localized sfences are straightforwardly supported by the 
following additional memory transition: 


L=LSF(X) Vi € X.FO(r) ¢ M.P(¢) 


LOCAL SFENCE - 
M Z5 psc M 
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Here, instead of blocking until all FO(7) entries are removed from all buffers, 
we only require that such entries are not present in buffers associated with 
variables from a certain set (mentioned in the action label and corresponding to 


the argument of the 1sfence(X) instruction). 


Persistence blocks. We assume an infinite set BlockID of block identifiers 

that are non-deterministically allocated when blocks are opened. The state of 

the memory system keeps track of a mapping assigning the current open block 

identifier to every thread and non-volatile variable, or L if the variable is not a 

part of an open block of the thread. When writing to non-volatile variables, the 

associated block identifiers are attached to the write entry in the per-location 
persistence buffer. In turn, the propagation from the buffers to the NVM ensures 
that blocks are propagated only after they are not open and only in their entirety. 

To do so, we generalize the persist step of PSC to allow simultaneous propagation 

of multiple entries from the buffers. To respect the per-variable FIFO order, the 

propagated entries should form a prefix of each buffer. 
Formally, this requires the following modifications: 

1. Write entries in buffers take the form j:W(v) where j € BlockID U {1} and 
v € Val (instead of W(v)). A write entry of the form L:W(v) means that the 
corresponding write was not a part of a persistence block. 

2. States are extended to be quintuples M = (m,m, P, B, Bid), where: 

— B : Tid > NVVar > (BlockID U {L}) is called the active-block mapping. It 
assigns a block identifier (or L if there is no active block) to every thread 
identifier and non-volatile variable. 

— Bid C BlockID x P(NVVar) is called the block identifiers set. It is used to 
store all persistence block identifiers occurring so far, each accompanied by 
the set of non-volatile variables that it protects. 

We denote by M.B and M.Bid the additional components of a state M. We 

impose the following well-formedness conditions: 

— If j:W(_) € M.P(«), then (j, {i} U X) € M.Bid for some X C NVVar. 

— If M.B(r)(z) # L, then (M.B(r)(«),{%} UX) € M.Bid for some X C 
NVVar. 

3. The initial state is given by Minit = (Minit, Minit, Pinit; Binit, Bedinit), where 
Bint = AT. Ak. L, and Bidinit = 0. 

4. The NV-WRITE transition records the current active block in the added entry: 


L=Wa,v)  p' = M.P(&) - M.B(r)(&):W(v) P' = M Plz > p" 


NV-WRITE i 7 
M Z psc M[P BP ] 


5. The following two transitions for opening and closing blocks are added: 


BEGINPB ENDPB 


l = beginPB(X) l = endPB(X) 
Yt € X. MB(T)(ż) =L 
. if E X then j ifi € X then L 
'- MB i ‘= M. ; 
B TÈ AÈ se MB(r)(z) | B= MBIT AP ase MB(r)(à) 
(j, ) €MBid Bid’ = M.BidU {(j, X)} 


M Z} psc M[B > B',Bid > Bid'] M Z} psc M[B > B'] 
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Thus, opening a block allocates a fresh identifier and sets the active-block 
mapping accordingly. In turn, closing a block resets the relevant variables in 
the active-block mapping. 

6. The following transition is used instead of PERSIST-WRITE and PERSIST-FO. It 
generalizes both PERSIST-WRITE and PERSIST-FO by simultaneously persisting 
several entries together (each p; below stands for a sequence of entries). 


l = per Vai. M.P(&) = pa - P' (è) 
Vj. (Bt. jW(-) € p) —> Yi. (Vr. M.B(T)(&) A j A jW(-) g P'(¢)) 
a . jv last write entry in p has value v 


` | M.n(a) there are no write entries in ps 
PERSIST 


M +ypsc MI mh’, PH P'] 


This step imposes two restrictions. First, the persisted entries from each buffer 
(pz) should form a prefix of that buffer, so that FIFO semantics is maintained. 
Second, to respect the persistence blocks, if some entry of a given block is 
persisted (Jt. 7:W(_) € p) then that block should not be currently active by 
any thread (Yt, T. M.B(r)(“) Æ j) and no entries of that block should remain 
in the volatile buffers (Vt. 7:W(_) g P’(«))). 


We note that nested and interleaved blocks are allowed. The beginPB(<, ý); 
program on the right demonstrates such a case. Here, x = 1 x:=1; 

and y = 1 must persist together; z = 1 and w = 1 must per- beginPB(z, w); 
sist together; but these two pairs can persist independently ż:= 1; w:= 1; 
of each other in any order. Thus, provided that the client endPB(z, w); 
and the library use blocks of their own locations, the block y= 1, 
instructions by each component are invisible to the other. endPB(x, ý); 


4.3 Separation Properties 


To enable our library abstraction proof, the required key property of PSC, which 
we preserved in its extensions, is the ability to separate PSC states into disjoint 
parts (the library’s part and the client’s part) and capture each memory tran- 
sition in terms of its effect on the two parts. Next, we formulate this property, 
which we will later use to prove library abstraction. In fact, our arguments for 
library abstraction rely only on the properties below, and never “unfold” the 
PSC-related definitions. This allows one to refine and extend PSC, as long as the 
separation properties are preserved. 

The separation of PSC states is stated in terms of the following restriction 
operator relative to a set of variables. For persistence blocks to behave correctly, 
we need an auxiliary condition on this set: we say that a set X C NVVar separates 
a state M € PSC.Qif for every (j, Y} € M.Bid, we have Ý C X or Ý C NVVar\X. 


Definition 7. The restriction of M € PSC.Q onto a set X C Var such that 
XMNVVar separates M, denoted by M|x, is the state M’ € PSC.Q given by: 

e M’ m(a) is M.m(a) if t € NVVar N X, or 0 otherwise. 

e M’' m(%) is M.m(Z) if z € VVar N X, or 0 otherwise. 
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e M’.P(«%) is M.P(«) if t € NVVar N X, or e otherwise. 
e For each T € Tid, M’.B(r)(«) is M.B(r)() if t € NVVarN X, or L otherwise. 
e M' Bid={(j,Y) € MBid|Y C X}. 


The next lemma states the separation property of PSC, providing a precise 
characterization of each PSC transition in terms of transitions on the restrictions 
M|x and M IVar\ x. A special case is needed for store fence transitions, since 
taking these transitions enforces conditions on both restrictions. 


Lemma 1. Let X C Var such that X N NVVar separates a state M1. 
1. For every T € Tid and l € Lab \ {SF} with varset(l) C X, 
My psc Mz <=> (Milx psc Mol|x A M1|Var\x = M2|var\ x) 
2. For every T € Tid, 
My psc M2 => (Milx 25 psc Mo|x A Milvar\x 25 psc M2|Vvar\x) 
3. My 25 psc Mz 4> (Milx 25 psc Molx A Mi|var\x Z5 psc Molva x) 
4. Mı *spsc Mz <=> (Mi|x 4+psc Molx A Milva x É, psc Məlva\x) 


The proof of Lemma 1 proceeds by standard case analysis ranging over all 
possible transitions of PSC. Finally, the following operation is used below to 
compose a state from a client and a library components (see Lemma 2). 


Definition 8. Let Mı, M3 be states of PSC, and X1, Xə C Var such that X4 N 
Xə = 0. The merge of Mı and Mə w.r.t. Xı and X2, denoted by (M1, Xj) © 
(M2, X2), is the state M € PSC.Q defined by: 


Miili) EX 


se ip ; similar definitions Tp: — {0 Y) € Mı.Bid | ye Xı}U 
Mali) = 4 Mia) texs for ta, Mp, Ma MBid= £(j,Y) € M2. Bia | Ý C Xo} 
0 otherwise = 


5 Libraries and Their Clients 


We present the notions of libraries and clients, as well as the necessary definitions 
for stating the abstraction theorem: histories and most general clients. 


Libraries. We take a library L to be a function assigning to method names in 
dom(L) C F flat instruction sequences representing the method bodies. In the 
context of some library L, we refer to the implementations of the methods in 
{main} U F \ dom(L) in a program Pr as the client of L. 


Client-library composition. We consider the common case where libraries 
and their clients never access the same shared variables. To formally define this 
restriction, we use the following notations for sets of locations used by instruction 
sequences, libraries, and their clients: 

e Var(I) denotes the set of shared variables mentioned in an instruction sequence 

I (possibly as a part of a set X of variables, e.g., in beginPB(X)). 

e For a library L, Var(L) = Uredom(zy Var(L(f)). 

e For a program Pr and a set F CF, 


Var(Pr \ F) = U,etia Var(Pr(r)(main)) U Ujer p Var(Pr(f)). 
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Then, client-library composition is defined as follows. 


Definition 9. A library L is safe for a program Pr if Var(L)NVar(Pr\dom(L)) = 
Ø. When L is safe for Pr, we write Pr{L] for the program obtained from Pr by 
setting Pr(r)(f) = L(f) for every 7 € Tid and f € dom(L). 


Note that we always have Var(Pr[L] \ dom(L)) = Var(Pr \ dom(L)). 


Histories. Histories record the interactions between libraries and clients. For- 
mally, a history h of a library L is a sequence of transition labels representing a 
crash, a call to a method of L, a return from a method of L, or an sfence, i.e., 
labels from the set HTLabgomz), which is defined as follows: 


Labp = {SF} U {CALL(f, $), RET(f, 4) | f € F,¢: Reg > Val} 
HTLabp = (Tid x Lab) U {4} 
Definition 10. Let t be a trace of PrxPSC for some program Pr. The history 
induced by t w.r.t. a set F C F, denoted by Hp(t), is the subsequence of t over 
HTLabp consisting of (in the same order they appear in t): call and return labels 
(7, CALL(f, ¢)) and (7,RET(f,¢@)) with f € F; SF-labels (7, SF); and crash labels. 
The notation H(t) is extended to sets of traces in the obvious way. The set of 
histories w.r.t. F of Pr, denoted by Hr(Pr), is given by Hp (traces( Prw PSC)). 
When F =F (i.e., the set of all method names), we simply write H(t) and H(Pr). 


Most general clients. We encompass library calling policies (see §2.3) using 
the notion of a “most general client”—a non-deterministic client that invokes 
the library methods in the most general way allowed by the policy. Formally, a 
most general client MGC is given as a (concurrent) program. Adherence to the 
calling policy is defined as follows. 


Definition 11. Let L be a library, and Pr and MGC be programs such that L 
is safe for both Pr and MGC. We say that Pr correctly calls L w.r.t. MGC if 
H dom(z) (Pr[£]) (e Haom(L) (MGC(L)). 


The policy of a library with no restrictions on its clients (beyond the separa- 
tion of shared resources) is expressed by an MGC, called MGC free, that repeat- 
edly invokes arbitrary library methods with arbitrary initial stores. Often persis- 
tent objects include a recovery method meant to be executed after a crash before 
any other method is invoked. We call such a policy MGC rec. Formally, MGC free 
(for dom(L) = {fi,..-.fr}) and MGCrec (for dom(L) = {fi,... fn} $ {recover}) 
assign the following main method to each thread 7: 


MGC free(T) (main) = MGCrec(T)(main) = 

BEGIN : havoc; a := CAS(x, 0,1); if a = 0 goto REC; goto WAIT; 
goto fı |... | fn | END; REC : call(recover); f := 1; goto BEGIN; 

fı : call(fı); goto BEGIN; WAIT : a := f; if a = 0 goto WAIT; goto BEGIN; 


wee BEGIN: ... rest of the code as in MGC tree ... 
fn : call(f,); goto BEGIN; 
END : 


In MGCrec, using a compare-and-swap, one thread performs the recovery. All 
other threads wait until recovery ends to start their method invocations. 
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6 The Library Abstraction Theorem 


In this section we state and prove the library abstraction theorem. The premise 
of this theorem, the library correctness condition, is formulated as follows. 


Definition 12. Let L and L* be libraries, both safe for a program MGC. We 
say that L refines L* w.r.t. MGC, denoted by L Emac L*, if both libraries 
implement the same methods and H(MGC{L]) C H(MGC{L*)). 


Next, the abstraction theorem states that L E mac L* ensures that any client 
adhering to the library’s calling policy may safely use the implementation L while 
reasoning about possible behaviors in terms of the specification L*. Our notion 
of “a behavior” includes the generated histories, as well as the reachable states, 
by the composition of the program and the memory system. Including reachable 
states is intended to assist safety verification. Clearly, we cannot require that the 
program states match for threads that are currently executing a method of L. In 
addition, since L and L* may update the memory differently (e.g., use different 
variables), we should only consider the variables of the client when inspecting 
the memory states. This leads us to the following statement. 


Theorem 1 (Abstraction). Suppose that L Emac L*. Let MGC and Pr be 
programs such that both L and L* are safe for MGC and Pr, and Pr correctly 
calls L* w.r.t. MGC. If Tinits Minit) + Pr[L] mPSC (q, M), then there exist t* and 
(q*, M*) such that the following hold: 


© (Tinits Minit) = Prtcejapse (G*, M*). 

e H(t*) = H(t). 

e For every T € Tid, if G(r).£ g dom(L), then G*(r) = Gr). 
7 M*\Var( Pr\ dom(L)) = M |Var( Pr\ dom(L)) (see Def. 7). 


Note that L LCyac L* is necessary for the conclusion to hold: otherwise, 
MGC itself is a client that can observe behaviors of L that are impossible for 
L*. Following §2.3, we also note that policy adherence is required w.r.t. to L*. 

To prove the abstraction theorem, the following key lemma is used multiple 
times (with different arguments). It allows us to compose the client’s part from 
one trace with the library’s part from another into one combined trace. 


Lemma 2 (Composition). Let L and L’ be libraries implementing the 
same set F of methods such that both are safe for a program Pr, and L is 
also safe for a program Pr’. Suppose that (pit; Minit) É, py[L/] PSC (Ga; Ma), 
(Ginits Minit) ts EE] mPSC (Tib; Mib), and Hpr(ta) = H r (tib). Then, there exists 
a trace ¢ such that H(t) = H(ta) and (init, Minit) + Pr[L]mPSC (q, M), for: 
g= (Gib (7)-PC, Gib(T)-%; fa (T)-Pcs, Fa(T)-£) Ga(r).£e F 

alr) otherwise 
e M = (Malvarcpr\r), Var(Pr \ F)) W (Miblvarcz), Var(L)) (see Def. 8). 
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The proof of Lemma 2 is based on the inherent disjointness in client-library 
composition provided by a library safe for its client program, which we leverage 
in the following two ways. 

Firstly, we extract client-local and library-local transition properties from all 
transitions of Pr[L']m PSC and Pr’[L]»PSC. Thus, when we consider a transition 
by Pr[Z’]»PSC corresponding to an instruction outside of a method of L’, we 
show that an analogous transition is possible with the same program state, 
but with memory state zeroing out locations used by the library L’. Similarly, 
when we consider a transition by Pr’[L]™PSC corresponding to an instruction 
in a method of L, we show that an analogous transition is possible with almost 
the same program state, except we alter its stored program counter, and with 
memory state zeroing out locations used by the client Pr’. The justifications for 
these steps follow by the (=) directions of Lemma 1. 

Secondly, we compose the client-local transition properties Pr exhibits in ty 
and the library-local transition properties L exhibits in tip while constructing 
transitions of Pr[Z]™PSC for a trace t. Knowing that L is safe for Pr, we con- 
sider client-local transition properties from tą corresponding to transitions we 
wish to recreate in t, and replace zeroed-out memory locations with locations of 
L. Dually, we consider library-local transition properties from tjip corresponding 
to transitions we wish to recreate in t, and replace zeroed-out memory locations 
with locations of Pr. The (<=) directions of Lemma 1 justify such transforma- 
tions. For instance, non-SF-transitions can be composed, provided that the client 
program preserves the library memory state, and vice versa; while crashes and 
SF-transitions record an interaction between a client program and a library and 
therefore need to be performed in synchrony. 

We use these two ideas in proving Lemma 2 by induction on the sum of 
lengths of tą and tj», and use their local transition properties to justify composing 
them in synchrony. For the base case, we can simply take t = e. For the induction 
step, we consider the last labels in ta and tp, as well as the cases when one of the 
traces is empty. When ta = —- aq and tib = -` Qib, we use t’ from the induction 
hypothesis for ty and tib with the last action removed from either or both of 
them, and let t =t'-a, or t = t + ayp. 

Then, the abstraction theorem is proved as follows. 


Proof outline for Thm. 1. It suffices to show H(Pr[L]) C H(Pr{L*]); then the 
claim follows using Lemma 2 by letting L := L*, L' := L, Pr := Pr, and Pr’ := 
Pr. Suppose otherwise, and let h be a shortest history in H(Pr[L]) \ H(Pr[L*}]). 
Let t be a shortest trace in traces(Pr[L]™PSC) with H(t) = h. Consider the last 
transition label a in t. The minimality of h and t ensures that a must be a return 
transition label for some f € dom(L). Indeed, otherwise, we can show that a is 
enabled in the end of a corresponding trace of Pr[L*|»PSC, which contradicts 
the fact that h ¢ H(Pr[L*]). (The full argument here requires applying Lemma 2 
with L := L#, L := L, Pr := Pr, and Pr’ := Pr.) 

Now, using the fact that Pr correctly calls L* w.r.t. MGC, we again apply 
Lemma 2 with L := L, L’ := L*, Pr := MGC, and Pr’ := Pr, and derive 
that a is enabled in the end of a corresponding trace of MGC|L|™PSC. Then, 
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L E mac L* ensures that Hgomr)(t) € Hadom) (MGC[L*]). Using Lemma 2 for 
the last time (applied with L := L*, L’ := L, Pr := Pr, and Pr’ := MGC), we 
obtain that h = H(t) € H(Pr[L*]), which contradicts our assumption. 

The following corollary of Thm. 1 states that, like classical linearizability, 
our correctness condition is compositional (a.k.a. local), meaning that a library 
consisting of several (non-interacting) libraries can be abstracted by considering 
each sub-library separately. Formally, the composition of libraries Ly, ... ,Ln with 
pairwise disjoint sets of declared methods, denoted by L,W...WL,,, is defined to 
be the library obtained by taking the union of Lj,...,£,. Compositionality is 
formulated as follows. 


Corollary 1 (Compositionality). The following two conditions together 
imply that D[1W...WL, Eyoc Liw...WL*: 
1. Var(Z1),...,Var(Dn), Var(L*), ...,War(L*), Var( MGC \ dom(Liw...WL,)) are 
pairwise disjoint. 
2. For all i, Li Emec, Li for MGC; = MGC[Liw...wL?_, W LP, Ww... WL). 


To end this section, we provide a simple lemma that allows one to establish 
L Cyuac L* by applying standard simulation arguments for crashless traces 
(with observable transitions being those that induce history labels). For that 
matter, we require a simulation relation on non-volatile memories generated by 
MGC([L]™PSC and MGC{L*|mPSC that holds for the very initial memory and 
preserved during crashless executions. 


Lemma 3. A trace t is rno-to-r if (Gini, Minit {tt rnol) + Prapsc (q, Mn 4 m]) 

for some q and M. Suppose that some relation R on NVVar — Val satisfies: 

© (Minit, Minit) E€ R. 

e If (rno, ms) € R, then for every rno-to-mn crashless trace t of MGC|[L]™PSC, 
there exist a non-volatile memory rh# and an mj§-to-7n* crashless trace t* of 
MGC([L*|™PSC, such that (rn, m*) € R and H(t) = H(t#). 

Then, assuming dom(L) = dom(L*), we have that L Emco L*. 


Furthermore, if MGC{[L*] has no fo(-) and sfence instructions, then MGC[L*] 
MPSC can take non-deterministic sfence steps (see §3) when MGC|[L]™PSC 
takes SF- steps, so store fences can be ignored when checking H(t) = H(t*). 


7 An Application: Persistent Pairs 


We illustrate the use of the library abstraction theorem for a simple concurrent 
and persistent data structure—a pair of values that supports write and read 
operations. We present two specifications and an implementation for each spec- 
ification. Both specifications ensure atomicity (i.e., linearizability if the system 
does not crash), and “data consistency” (reads return values written by a single 
write invocation), but they differ in their persistency guarantees. For the concur- 
rency aspect, the implementations follow the sequence lock (seqlock, for short) 
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mechanism, which uses a version counter along with the pair and allows read- 
ers to avoid blocking [6]. For durability, the implementations employ different 
techniques: one uses a “redo log” and the other is based on “checkpoints”. 


A durable pair. The first specification, a library we denote by Dies consists 
of three methods: write for writing the two values of the pair, read for reading 
the pair, and recover for recovering from a crash. The specification is as follows:° 


write: read : 

LOCK: if CAS(1,0,1) LOCK: if CAS(1,0, 1) 
goto LOCK; goto LOCK; 

beginPB(x:, x2); a= Rij ae := Xo; 

X1 = a1; X2 := ag; UNLOCK: 1 := 0; 

endPB(x1, x2); return; 

£1(%1); | 

UNLOCK: 1 := 0; recover : 

return; return; 


A volatile lock (1) is used to ensure atomicity. For durability, writes use persis- 
tence blocks, which ensure that the two parts of the pair persist simultaneously. 
After the block is ended, £1(x1) (equivalent here to £1(x2) due to the persis- 
tence block) ensures that the block persists. If the system crashes after a write 
completed, the written values are guaranteed to survive the crash. Thus, there is 
nothing to be done at recovery. Nevertheless, aiming to allow implementations, 
the library policy requires that recovery is executed after every crash before 
other methods are invoked (MGC ec in §5). 

Next, we present an implementation of Le es which we denote by Lpair. We 
write x := y instead of a read of y (to some fresh register) followed by a write 
to x. We also omit some necessary register bookkeeping: since histories record 
the whole register store in call/return labels, strictly speaking, implementations 
must unroll changes to registers not used to pass return values. 


write: read : recover : 
LOCK: if CAS(1,0, 1) BEGIN: a:= 68; if even(s) 

goto LOCK; if odd(a) goto END; 
KT" := a1; fo(xie"); x2" := ao; fo(x3°"); goto BEGIN; X1 = x7"; fo(x1); 
sfence; ay := X1} a2 := Xo; Xo := x3°"; fo(x2); 
S := $ + 1; fl(s); ifsfa sfence; 
X1 := a1; fo(X1); X2 := a2; fo(x2); goto BEGIN; END: $s :=0; 
sfence; return; return; 
s:=s+41; 
UNLOCK: i := 0; 
return; 


Ignoring crashes, atomicity is guaranteed here using a seqlock. As for persistency, 
observe first that writing directly to the NVM is wrong since we cannot control 
the non-deterministic propagation: if a crash occurs during the execution of 
write, it is possible that only one part of the pair has persisted, and the recovery 
method will not have sufficient information for reinitializing the pair correctly. 
Instead, write first records its “job” in (x?®", x3°”). Then, if a crash happens and 


5 Our simplified language has no mechanism for argument passing. We assume that 
write receives arguments (read returns results) via designated registers, a; and ag. 
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the write was in the middle of updating (x;,x2) (as identified via observing an 
odd version number), the recovery will complete the job of the writer. We note 
that the (rather extensive) use of flushes (or flush-optimals followed by an sfence) 
is necessary here in order to restrict the out-of-order persistence. The final write 
to s in write does not have to be explicitly persisted. Indeed, if a crash happens 
between this write and its persistence, recovery will redo the (idempotent) job. 


m # 
=MG C rec Diaw 


Theorem 2. Lpair 
Our proof sketch uses Lemma 3, letting (rn, rn#} € R if the following hold: 

e If m(8) is even, then m(x) = m*(x1) and m(x2) = m* (ža). 

e If m(8) is odd, then m(x?) = m*(x1) and m(x3e") = m” (x2). 

Using the abstraction theorem, we obtain that for a program Pr that uses 
Lair correctly (i.e., calls recovery first after every crash), for every state (q, M} 
that is reachable in Pr[Lpair]PSC, there exists a state (¢*,.M*) reachable in 
Pr[|LË |x PSC and indistinguishable from (q, M) from the client perspective. 


pair 
A buffered durable pair. A second specification, denoted by LË pair allows for 
“buffered” behaviors, which enable faster implementations by weakening persis- 
tency guarantees [24]. Instead of requiring operations to persist before returning, 
it only requires that operations are “persistently ordered” before returning. 


write: read: recover: 

LOCK: if CAS(1,0, 1) LOCK: if CAS(1,0, 1) return; 
goto LOCK; goto LOCK; 

beginPB(x1, x2); al := X13 a2 := %2; sync: 

žı := a1; X_ (= a2; UNLOCK: i := 0; #1(k:); 

endPB(x1, X2); return; return; 


UNLOCK: i := 0; 
return; 

Compared to LË air the explicit flush instruction £1(x,) from the write method 
is omitted, which means that a crash after a completed write may take the pair 
back to its state before the write. Thus, the state after a crash need not necessar- 
ily be fully up-to-date. An additional method, called sync, can used to ensure that 
previous writes have persisted. Without sync, an implementation could simply 
ignore persistency and store the pair in the volatile memory, which corresponds 
to an execution of L* „;„ in which the persistency buffers are never being flushed. 


bpair 
An implementation can be obtained as follows: 
write: read: sync: 
LOCK: if CAS(1,0, 1) BEGIN: a := Š; LOCK: if CAS(1,0,1) 
goto LOCK; if odd(a) goto BEGIN; goto LOCK; 
s:=s+1; al := X13 a2 := Xo; ay := X13 a2 := Xo; 
Xi := a1; %2 := a2; if 8 Æ a goto BEGIN; PTY = KI: F0(xP”); 
5:=5 +1; return; iT = ag. ; £o(x5"°"); 
UNLOCK: 1 := 0; recover oe 
return, if È = 1 goto PREV; f := 1; f1(f); : 
Xi: _ = xnet. Xo t= _ quent. NEXT: xi := al; fo(xi™* ); 
return; I = az; £0(i"*); 
PREV: %4 := x's Ko = x; sfence; 
f := 0; f1(f); f := 0; £1(£); 
return; UNLOCK: 1 := 0; 


return; 
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This implementation exploits the freedom allowed by the specification. Writes 
and reads again employ a seqlock, but this time they only use volatile variables. 
In turn, sync sets a “checkpoint”, and recovery rolls the state back to the lat- 
est complete checkpoint. For that matter, a non-volatile flag f is used to de- 
tect crashes during the setting the checkpoint (x?***, x3°**). Thus, before storing 
the checkpoint, the previous checkpoint is stored in the non-volatile variables 
(219359). Upon recovery, given the value of the flag, we know if we can re- 
store the state from the current stored checkpoint, or, if a crash happened during 
the store of this checkpoint (which means that sync did not return), set the pair 


to the previous stored one. 


Theorem 3. Lyppair E MGC e Lf pair 

Our proof sketch uses Lemma 3, letting (rn, rn#)} € R if the following hold: 
e If m(f) =0, then m(x2***) = në (x1) and m(x2%*) = m” (xp). 

e If (t) = 1, then rn(xh"°") = m* (x1) and m(x) = më (xo). 


8 Related and Future Work 


Library abstraction theorems. Previous work has developed library abstrac- 
tion theorems for crashless shared memory concurrency. First, [13] formalized the 
intuition that standard linearizability as defined in [21] corresponds to contextual 
refinement (and also proved a completeness result: the converse also holds pro- 
vided that threads have other means of interaction besides the library). Later, [7] 
refined and formulated this result using history inclusion instead of linearizabil- 
ity, which is closer to our formalization. Other abstraction results account for 
liveness [16], resource-transferring programs [17], and x86-TSO [8]. Our compo- 
sition lemma (Lemma 2) is inspired by [8], which addresses a challenge that is 
close to the challenge posed by store fence instructions in NVM, where actions 
of the client and the library affect each other even if they access to distinct 
locations. To do so, the notion of a history is extended to expose events that 
correspond to the flushing certain entries from the x86-TSO store buffers, which 
is close to what we do to handle store fences. Our alternative approach to this 
problem, i.e., introducing a relaxed version of the store fence, is novel. 

While our framework is operational, library abstraction was also studied 
before for declarative shared memory concurrency semantics, particularly in the 
context of the C11 weak memory model [5,28]. 


Linearizability notions for persistent objects. Different approaches for 
adapting the standard linearizability criterion that is based on crash-free se- 
quential specifications [21] were proposed before [3,19,24], but were not formally 
related to contextual refinement. Since methods like recover and sync (see §7) are 
meaningless in crash-free sequential specifications, they require an ad-hoc exter- 
nal treatment in these linearizability adaptations. The variety of approaches to 
interpret crash-free sequential specifications for crash-resilient concurrent objects 
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makes it hard, in particular, to combine libraries with different linearizability 
guarantees in a single program. 

In turn, these existing notions are typically expressible in the refinement 
framework that we employ. For example, in the crashless setting, by wrapping 
each method of a sequential implementation S of some object inside a global 
lock, one obtains an abstract library L% for that object that corresponds to the 
conditions imposed by standard linearizability [7] (a library L is linearizable 
w.r.t. S iff every crashless history induced by a trace of MGCL] is also induced 
by some trace of MGC[L%]). Now, when crashes are involved, by wrapping each 
method of S inside a global lock and a persistence block followed by an explicit 
flush instruction (like L¥,;, in §7), one obtains an abstract library L%, that 
corresponds to the conditions imposed by strict linearizability of [3] (L is strictly 
linearizable w.r.t. S iff L Lyeo L5, ). Thus, our results can be used to derive 
contextual refinement (using LË ; as a specification) from strictly linearizable 
objects. We note that while the original definition of strict linearizability was for 
a model with per-processor failure, what we consider here is its application for 
full system crashes. 

Durable linearizability [24] weakens strict linearizability by allowing methods 
that were active during a crash to take their effect at any later point in the 
execution (or never), instead of requiring that the effect of such methods is 
visible immediately after the crash (or never). This weakening aims to allow lazy 
recovery for large structures, where either the recovery procedure is executed in 
parallel to other methods after a crash, or the methods themselves participate 
in recovering the data structure when they are further executed. This notion 
can be also expressible as an abstract implementation in our language. For this 
matter, every update method in the specification would: first record its task 
in a work-set; remove the task from the work-set; flush the updated work-set; 
and perform the task like in L% ; described above. In turn, every query method 
may choose to complete any task it finds in the work-set, since the method 
performing such a task has crashed during its invocation. For persistent pairs 
(see §7), this is illustrated by the specification below. The non-volatile variable 
w is the multiset holding the work-set with atomic add and remove operations, 
and 1,4 is an abstract multiple-readers-single-writer lock used to resolve races 
on the work-set. 


write: read: 

LOCK1: acquire 1,, as a reader; goto {LOCK1, BEGIN}; 

add (ai,a2) to w; LOCK1: acquire 1,, as a writer; 

remove (ai, a2) from WÙ; pick some (aj, a2) € Ù; 

f£1(w); remove (aj, a2) from wÙ; 

UNLOCK1: release Ír; f1(w); 

... continue as in write of L*air (§7)... ... write (a1, a2) to (x,y) as in write of Lair (87)... 
recover : UNLOCK1: release 1,3 

return; BEGIN: ... continue as in read of L*,;, (87)... 


A “buffered” version of strict linearizability, which only requires the exis- 
tence of a prefix of the completed invocations to be observed after a crash, is 
also naturally derived by considering L% ¿p Which is obtained from a sequential 
implementation S by wrapping each method of S inside a global lock and a per- 
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sistence block (without an explicit flush instruction) and ensuring that there is a 
single non-volatile variable that is written to by all library methods (introducing 
such a variable if needed).° 

An alternative operational characterization of durable linearizability using 
Input/Output automata was developed in [12] and used to formally establish 
this property for the persistent queue of [14] by providing a full-blown simulation 
proof using the KIV proof assistant.’ Nevertheless, this work does not relate the 
proved correctness criterion to contextual refinement. 


Persistency models. The underlying model we assume is PSC by [25], a 
strengthening of Px86 [30] that formalizes the Intel-x86 persistency. The pa- 
per [25] provided compiler mappings that ensure PSC semantics on machines 
guaranteeing Px86 semantics. We extended the general semantic framework with 
libraries, and extended PSC with local store fences and persistence blocks. 


Future work. Future work includes extending our proof method and results 
for weaker persistency models, such as persistent x86-TSO [30] and ARM [10]; 
handling random access shared memory with allocations and deallocations (in- 
stead of the simplified shared variables model we employ); and lifting the strict 
condition that libraries and clients live in disjoint address spaces by allowing 
them to transfer ownership of certain locations (as was done in [17] for standard 
volatile memory). 

In addition, extending and adapting methods for refinement verification un- 
der volatile memory is needed in order to provide library developers with means 
to validate our library-correctness conditions. Such methods may include au- 
tomated checking by approximation [7], layered interactive verification in the 
style of [20,27], and formal logics as the one in [26]. Similarly, developing formal 
methods and tools that allow using library specifications for client reasoning is 
left for future work, including decidable reachability analysis [2], program log- 
ics [29], and principled testing [15]. Finally, it is interesting to see how logical 
atomicity notions established by program logics, such as [11,31], which has been 
extended to cover crashes in disk-based storage systems [9], can be adapted for 
establishing our correctness condition and/or for client reasoning. 


® Since the corresponding “buffered” correctness notion is not compositional, while the 
refinement-based notion is (see Corollary 1), one cannot expect to have a per-object 
translation of a sequential implementation S into a concurrent and persistent imple- 
mentation L$, 4p- Indeed, the addition of a single non-volatile variable that is written 
to by all library methods is a not a per-object translation (i.e., for two sequential 
library implementations implementing disjoint sets of methods and operating on 
disjoint variables, Sı and S2, we will not have L$ us24b = L$ iib U L54) 

T See https: //kiv.isse.de/projects/Durable-Queue.html. 
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Abstract. We consider the problem of statically detecting data races in 
periodic real-time programs that use locks, and run on a single processor 
platform. We propose a technique based on a small set of rules that 
exploits the priority, periodicity, locking, and timing information of tasks 
in the program. One of the key requirements is a response time analysis 
for such programs, and we propose an algorithm to compute this for 
the case of non-nested locks. We have implemented our analysis for real- 
time programs written in C in a tool called PEPRACER and evaluated 
its performance on a small set of benchmarks from the literature. 


Keywords: Real-Time systems - periodic programs - static analysis - 
data races - WCRT Analysis 


1 Introduction 


Periodic real-time applications (or simply periodic programs) are a class of real- 
time systems that comprise a set of tasks, each of which comes with an associated 
priority and periodicity, and are executed according to a scheduling policy like 
priority-based preemptive scheduling, on a real-time operating system. Thus 
each task is made ready to run at the beginning of its period (though it may 
actually get to execute only later depending on its priority and how long it has 
been waiting in the ready queue), and may be preempted during its execution by 
higher priority tasks that have been made ready to run. Many of these systems 
are safety-critical in nature, being widely employed in avionics, robotics, and 
autonomous systems. 

These systems are also essentially concurrent in nature (even if we consider 
single processor platforms), since a running task may be preempted by a higher 
priority task, causing them to interleave in time. With concurrency come the 
attendant problems of data-races: it is not difficult to imagine a scenario where 
a low priority task is updating a shared data-structure or even a multi-word 
variable like a long int, when it is preempted by a higher priority task that 
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goes on to access the potentially inconsistent shared data. Thus it is common for 
real-time application developers to use synchronization mechanisms like locks to 
protect accesses to shared data structures (like the ones used to control wheel 
movement in a robot) or resources (like an LCD display). Real-Time operating 
systems typically provide a variety of lock mechanisms from standard locks or 
semaphores to priority-inheritance based locks [18]. 


Our focus in this paper is on giving a way to statically (that is by analyzing 
the source code of the application, rather than running it) detect races in periodic 
programs that use standard locks. The emphasis in static analysis techniques 
is on soundness: we do not eliminate a pair of conflicting accesses unless we 
can prove that they do not race. The other side of the coin is precision: how 
close is the set of potential races reported to the actual set of races in the 
program. The basic technique used in the programming languages community 
to statically detect races is a lockset analysis, which computes the set of locks that 
are must-held at each statement in a task, and declares two statements to be non- 
racy if they hold a common lock. More recent techniques [17,20] exploit priority 
information to declare accesses to be non-racy: for instance a high-priority task 
does not need to protect its accesses from a lower priority task. 


However, none of these techniques seek to exploit the inherent periodic na- 
ture or execution times of the tasks in these programs. For example, a simple 
observation is that if two tasks have the same period and don’t take any locks, 
they can never overlap in time. Exploiting timing information is also key to 
improving the precision of a race analysis technique for these programs. The 
notion of worst-case response time (WCRT) of a task measures the maximum 
time an instance of the task may take to complete its execution starting from the 
beginning of its period. As an example of how we can use conservative WCRT 
estimates, if we can conclude from the WCRT information that a low-priority 
task always finishes execution before the next arrival of a high-priority task, we 
can declare them to be non-racy. 


While computing the WCRT of tasks in periodic programs is well-studied 
in the real-time systems community, starting from [13,12] for periodic programs 
without locks, and for periodic programs with priority-inheritance-based locks 
[18], as far as we are aware there are no techniques available for periodic programs 
with standard locks. One of the contributions of this paper is to extend the 
classical technique of [12] to compute WCRT estimates for programs with non- 
nested locks, given worst-case execution time (WCET) estimates of tasks and 
lock-unlock blocks (or critical sections). 


We then go on to give a set of six rules (in the spirit of the ideas described 
above) to soundly eliminate pairs of conflicting accesses, leading to a sound, 
efficient, and fairly precise race-detection technique for such programs. 


We have implemented our analysis in a tool called PEPRACER for detecting 
races in such programs written in C. One of the inputs to the tool is a WCET 
analysis for different blocks in the program tasks, which we obtain using the 
WCET analysis tool Heptane [11]. We have run our tool on several benchmarks, 
including robot controllers from the nxtOSEK project [2]. Our tool runs in a 
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fraction of a second on these benchmarks, and on the average eliminates 97% of 
conflicting access pairs as non-racy. 

An overview of our technique is presented in the next section on an exam- 
ple adapted from one of our benchmarks. Periodic programs and their execution 
semantics are introduced in Sec. 3. Sec. 4 formally defines the notions of conflict- 
ing accesses and data races. Algorithms for computing safe bounds on response 
times of periodic programs with locks are presented in Sec. 5.2. Sec. 6 gives 
the rules for disjointedness of tasks and the race detection algorithm for peri- 
odic programs. Our experiments on benchmark examples are detailed in Sec. 7, 
followed by a discussion on related work in Sec. 8. 


2 Overview 


We provide an overview of our technique with an illustrative example adapted 
from the “lego_osek” robot controller, based on the OSEK operating system, 
from [2]. Fig. 1 shows some excerpts from this example. The controller’s job is 
to control the motion of the two-wheeled robot to follow a line (that it detects 
using light sensors), it also detects obstacles along the way (using a sonar sensor) 
and avoids them by braking and moving to the left. The controller has two tasks 
TaskControl and TaskObstAvoid that do the line-following control and obsta- 
cle detection and avoidance respectively. TaskControl has high priority (higher 
value indicates higher priority) and runs every 10ms, while TaskObstAvoid has 
low priority and runs every 30ms. The two tasks access some shared locations, 
including structures for actuating the left and right wheel motors, an LCD dis- 
play, and a boolean “obstacle-detected” flag. TaskControl reads two light sensor 
values, does some computation with them, and writes them to the LCD dis- 
play. The access to the LCD display is protected by acquiring and releasing the 
1cd_lock lock. Finally it computes the new speed and brake values that are 
then written to the wheel motor structures, after checking that the obstacle 
flag is not set. The TaskObstAvoid task reads the sonar and left light sensors, 
does some computation on them, sets the obstacle flag based on these values, 
and displays them on the LCD (making sure to take a lock on it first). If the 
obstacle flag was set, it goes on to write to the left wheel structure to brake 
and turn the robot to the left. 

We note that there are several conflicting accesses to the shared variables, 
including lines 13 and 33 to 1cd, lines 16 and 29 and 16 and 31 on obstacle, 
and lines 19-20 and 36-37 on left_wheel. Apart from the accesses to 1cd which 
are protected by a lock, the other accesses appear to be racy at first glance. For 
instance, while TaskObstAvoid is updating the left wheel structure, it could be 
preempted by the higher priority TaskControl which goes on to write to the 
same structure, potentially leading to a harmful race. 

Our key idea is to exploit the priority, periodicity, and worst case response 
times of the tasks, to show that these accesses cannot race. Fig. 2 shows the 
periodic execution of the two tasks. Notice that if the low priority task is guar- 
anteed to finish its execution before the next instance of the higher priority task 
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1. // Shared structures and variables 23. void TaskObstAvoid() {// Per 30, Prio 1 (low) 
2. struct motor right_wheel; 24. int sonar_value, sensor_left; 

3. struct motor left_wheel; 25. // Read and calibrate sensor values 
4. struct display lcd; 26. sonar_value = get_sonar_sensor() ; 

5. bool obstacle = 0; 27. sensor_left = get_light_sensor (left) ; 

28. iE Cea.) 

6. void TaskControl() {// Per 10, Prio 2 (high) 29. obstacle = 1; 

Te int sensor_right, sensor_left; 30. else 

8. // Read and calibrate sensor values 31. obstacle = 0; 

9. sensor_right = get_light_sensor (right); 32. lock(1cd_lock) ; 

10. sensor_left = get_light_sensor (left); 33. show_var(sonar_value, sensor_left) ; 
11. lock(1cd_lock) ; 34. unlock(1cd_lock) ; 

12. // display sensor values on LCD 35. if (obstacle) { // avoid by moving left 
13. show_var(sensor_right, sensor_left); 36. left_wheel.speed = ...; 

14. unlock(1cd_lock) ; 37. left_wheel.brake = 1; 

15. // Motor control, uses sensor values 38. } 

16. if (!obstacle) { 39. } 

1T. right_wheel.speed = ...; 

18. right_wheel.brake = 0; 

19. left_wheel.speed = ...; 
20. left_wheel.brake = 0; 
2i. } 
22. } 


Fig. 1: An example periodic program adapted from Lego-OSEK 


is scheduled, there can be no interleaving of the two tasks, and we can declare 
all the conflicting accesses as non-racy. However, concluding this in the presence 
of locks is not easy, and our first contribution is a way of computing an estimate 
of the worst case response times for tasks that take non-nested locks (like in 
the example program). Using raw WCET times of the tasks and its lock blocks 
(like lines 11-14) for the platform the robot controller is to be run on, we use 
Algo. 2 (described in Sec. 5) to compute an estimate of the response time of 
TaskObstAvoid. Rule 3 (described in Sec. 6) then allows us to eliminate all the 
pairs of conflicting accesses as non-racy. 


We note that techniques such as [17,20] that consider task priorities and locks 
(but not periodicities and response times) would not be able to eliminate any of 


the conflicting access pairs, except the accesses to Lcd which are protected by a 
lock. 


WCRT est. of task L 
Tr 


Fig. 2: Task timelines for Lego-OSEK example 
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3 Periodic Programs 


A periodic program is a collection of tasks. Each task has an associated function, 
period, and priority. There is a designated init task which is the only task that 
is ready to run initially. An execution of the program begins with running the 
function associated with the init task, which initializes shared variables. It then 
makes other tasks ready to run using the start command. The init task runs 
only once. 

The execution of the tasks is orchestrated by a priority-based preemptive 
scheduler. It is important to point out here that we are assuming a single pro- 
cessor platform. The scheduler selects one of the enabled tasks for execution 
on a highest-priority-first basis. A task with period T is enabled every T time 
units. If there are more than one tasks of the highest priority ready to run, the 
longest waiting task is picked for execution. This is also known as First-Come- 
First-Served (FCFS) scheduling. 

The task functions operate on a set of shared variables V using assignment 
statements and accesses to the shared variables can be synchronized using the 
lock-unlock commands. The set of commands (over a set of variables V) Cmdy 
that can be used in a periodic program are shown in Table 1. 


Table 1: Periodic Program Commands Cmdy 


Statement Description 

start Make all tasks ready for execution. 
begin Begins execution of the task. 

end Ends execution of the task. 

skip Do nothing. 

Li=e Assign the value of expression e to 2. 


assume(b) /Enabled only if expression b evaluates to true; 
does nothing. 

lock(l) Take lock | if available; 

otherwise block till | becomes available. 
unlock(l) |Release lock l. 


Formally, a periodic program P is a tuple (V, L, T) where V is a finite set of 
shared variables, L is a finite set of locks, and T is a finite set of tasks, including 
a designated init task. A task r € T isa tuple (G+, T+, pr), where G+ is the task 
function, T, is the period of the task, and p, is its priority. The task function 
G- is represented as a Control Flow Graph (CFG) G, = (Loc,, I, entr, extr), 
where Loc, is the finite set of locations of 7, I C Loc, x Cmdy x Loc, is the 
set of instructions of 7, and ent,, ext, E€ Loc, are the entry and exit locations 
respectively of r. We denote the set of locations and instructions in P by Locp = 
U,e7 Loc; and Ip = Uey I; respectively, assuming the set of locations to be 
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disjoint across tasks. We will drop the subscripts whenever they are clear from 
the context. 

An example periodic program and the CFG representation of one of its tasks 
ObsDect are shown in Fig. 3. The periodic program has two tasks that imple- 
ments a simple robotic controller, apart from the default init task. The ObsDect 
task function detects an obstacle based on the sensor input in the s/n variable 
and makes a corrective action. The MoveForward task function directs the robot 
to move forward if there is no obstacle. The ObsDect task has high priority (value 
2) and runs every 100 time units, while the MoveForward task has lower priority 
(value 1) and runs only every 200 time units. Both the tasks access the shared 
variables obstacle and forward. 


init: 

1. obstacle := 0; 
2. forward := 0; 
3. sIn := 0; 

4. 10 


obstacle:=0 
// Period = 100, Prio = 2 


ObsDect: 11 
10. obstacle := 0; @ 
ii. if (sIn <= 10) { assume (sIn<=10) / assume (sIn>10) 
12. obstacle := 1; P 
13. forward := -100; 
14. } obstacle:=1 
15. y 
@3 


// Period = 200, Prio = 1 forward:=-100 


MoveForward: | 
20. if (!obstacle) eo! 
21. forward := 100; 
22. skip Y 
e 15 
(a) An example program (b) CFG of the ObsDect task 


Fig. 3: Example program and the CFG representation 


We now define the semantics of a periodic program P = (V, L, T) as a labeled 
transition system Sp = (S, Sin, =} where S is the set of states, Sin € S is the 
initial state, and = is the transition relation, as defined below. In the following, 
Qr denotes the set of possible task priority queues and e denotes an empty queue. 
We also assume that the tasks have distinct priorities in P = {1,...,k} with 
a higher value indicating higher priority. For an integer expression e, boolean 
expression b, and an environment ¢ for V, we denote by [el] the integer value 
that e evaluates to in ¢, and [b] denotes the boolean value that b evaluates to 
in @. For a function f : X — Y, and elements x € X and y € Y, we use the 
notation f/x +> y] to denote the function f’: X — Y given by f'(x) = y, and 
f'(z) = f(z) for all z different from z. 

A state s € S is a tuple (R, W, A, B, pc, 4, tick, r) where 


— R is a priority queue of tasks that are ready to run, 
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— W CT is the set of tasks that are waiting to be scheduled, 

A€ L — T is a partial map that gives, for each lock, the task that has 
acquired the lock, 

— Be L — Qr is a map that gives, for each lock, the priority queue of tasks 
that are blocked on the lock, 

pc E€ T — Locp is a map giving the current location of each task, 

o€ V > Z is a variable to value map, 

— tick € N is the time units elapsed since the program started, and 

— r € T is the currently running task. 


The initial state sin is defined to be (e, T — {init}, 0,0, Ar.ent,, Ax.0, 0, init) 
denoting the fact that initially the init task is the running task while no other 
tasks are ready to run and instead are waiting to be scheduled, none of the tasks 
have acquired locks and hence they are not blocked, all the tasks are at their 
entry locations, all the variables are initialized to zero, and so is the tick counter. 

We now define the transition relation = C S x Ip x S as follows. For a state 
s = (R,W, A, B, pc, ġ, tick,r), a task T, and an instruction v = (l,c,l') in G,, 
we have s =>, s’ iff one of the rules in Fig. 4 is satisfied. If for a command c, the 
conditions on state s specified in the antecedent (the ones mentioned above the 
line) holds then s =, s’ in the consequent (the one below the line) also holds. 

In the START rule, for the start command executed by the init task, all the 
tasks in W that are waiting to be scheduled onto the ready queue are enqueued 
onto R. We now pick the highest priority task, which is at the head of the 
updated ready queue, to be the next running task. Once the init task executes 
the start command, it plays no role in the rest of the execution. 

The rule uses the ENQ(Q, S) function which when given a priority queue Q 
of tasks and a set S of tasks, enqueues each task in S onto the queue Q. The 
function enq(Q, s) is the standard enqueue function for a priority queue Q. The 
function deq(Q) returns the queue with the head element removed. The function 
head(Q) when given a priority queue Q of tasks returns the task with the highest 
priority, which is at the head of Q. 

The END rule is defined for the end command to signal completion of the 
currently running task. Hence the task is inserted into the wait list W. Moreover, 
the highest priority task in the ready queue R, which is at its head, is removed 
from R and made the running task. The rule requires that the ready queue R 
be non-empty. 

The ALOCK rule is defined for the lock(m) command. If the running task r 
requests for a lock m which is not acquired by any task (as given by A(m) = 
undef) then the running task proceeds with acquiring the lock. The BLOCK rule 
is defined for the lock(m) command when the running task cannot acquire the 
lock. If the running task r requests for a lock m which is acquired by a task 7’ 
(as given by A(m) = 7’) then the running task r is blocked by en-queuing it 
onto the blocked queue B(m). This calls for a re-schedule and hence the highest 
priority task from the non-empty ready queue R is made the running task. 

The UNLOCK rule is defined for the unlock(m) command. If the running 
task r requests for the release of the lock m which it was holding or it was the 
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c= skip pc(r)=l tT=r 
s >, (R,W, A,B, pe[r + I’), ¢, tick, r) 


SKIP 


c=a:=e pcer)=l tr=r face 


S=. (R,W, A, B, pefr Se il; ofz re lels], tick, r) 


c= begin pe(r)=l r=r 
s >, (R,W, A, B, pe[r + I’), ¢, tick, r) 


BEGIN 


c=assume(b) pe(T)=l r=r [b] = true 
s >, (R,W, A, B, pe[r + l'], ¢, tick, r) 


ASSUME 


c= start pce(r)=l T=r = init 
s =>, (deq(ENQ(R, W)), 0, A, B, pelr 4 l'], ġ, tick, head(ENQ(R, W))) 


START 


c=end pe(r)=l r=r Re 
s 5, (Gaq(R), WU {r}, A, B, pelr > l], $, tick, head(R)) 


END 


c=lock(m) pe(r)=l Tr=r A(m) = undef 
s >, (R,W, Alm 7], B, pelr > 1), ġ, tick, r) 


ALOCK 


c=lock(m) pe(r)=l T=r A(m)=7' R#e 
s =>, (deq(R), W, A, B[m = enq(B(m),r)], pc, $, tick, head(R)) i 
c=unlock(m) pe(r)=l r=r (A(m)=rV A(m)= undef) Bim) =e 
s >, (R,W, Alm => undef], B, pc[r > l'], 4, tick, r) 


UNLOCK 


c=unlock(m) pce(tr)=1 Tr=r A(m)=r Q=B(m)#e head(Q)=T" pr < pr 
s >, (enq(R, T), W, Alm undef], Bm +> deq(Q)], pelr + 1’, ¢, tick, r) SS 


c=unlock(m) pe(r)=l r=r A(m)=r Q=B(m)#e head(Q)=7' pr > pr 
s =>, (enq(R,r), W, A[m undef], Blm + deq(Q)], pc[r > l'], Q, tick, T’) AE 


v = inc(tick) S ={r' € W |v is a multiple of Ty} A 
s =>, (deq(ENQ(R, SU {r})), W \ S, A, B, pc, ġ, v, head(ENQ(R, S U {r}))) 


Fig. 4: Transition relation capturing the execution semantics of a periodic pro- 


gram 
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case that no task was holding the lock (as given by A(m) = r V A(m) = undef) 
then the running task can proceed with releasing the lock. Further, if there are 
no tasks blocked on this lock m (as given by B(m) = e) then the current task 
continues to be the running task. The UNL-WK rule is defined for the unlock(m) 
command when a low priority task is blocked on the lock. If the running task 
requests for the release of the lock m which it was holding and a task 7’, at 
the head of the blocked priority queue B(m), is blocked on the lock, of priority 
lower than the running task, then 7’ is unblocked by dequeing it from its blocked 
priority queue B(m) and enqueing it onto the ready queue R. Task r continues 
to be the running task. The UNL-Cs rule is defined for the unlock(m) command 
when a high priority task is blocked on lock m. If the running task requests for 
the release of the lock m which it was holding and a high priority task r’ is 
blocked on the lock then 7’ is unblocked by dequeing it from its blocked queue 
B(m). The task 7’, being of higher priority, is selected as the next running task 
while the current running task r is enqueued onto the ready queue R. 

The Tick rule models the handling of a timer interrupt, signalling that a 
unit of time has elapsed. The tick counter is incremented by one, and the tasks 
in W whose periods divide the tick count, are moved to the ready queue R. The 
current running task r is also enqueued onto the ready queue. We now pick the 
highest priority task in the updated ready queue, which is at its head, as the 
next task to run. 

The SKIP, BEGIN, ASSIGN, and ASSUME rules for the skip, begin, assign- 
ment, and assume commands, respectively, are standard. 

An execution of a periodic program P is a finite sequence of transitions 
p = 61,..-,0n (n > 1), such that there exists a sequence of states so,..., Sn of 
S, with each 6; E€ = of the form (si—1, li, Si) for some 4;, and so = Sin. 

The semantics we have defined so far abstracts away the “real-time” aspect of 
a periodic program. We can obtain the real-time semantics of a periodic program 
by considering a concrete execution environment which fixes the execution time 
of each instruction (say in a bounded interval of time), and restricting ourselves 
to executions where the tick interrupt is driven by a real-time clock and is con- 
sistent with the time taken to execute instructions between two ticks. Henceforth 
we fix such an environment and focus on the induced subset of executions of a 
periodic program. 


4 Data Races 


Let P = (V, L, T) be a periodic program. In an execution of P, tasks are executed 
periodically and hence during the course of execution of P many instances of 
a task get executed. Consider two tasks 7, and T2 in 7, and two non-empty 
paths 7 and z’ in Gn and G,,, respectively. We say m and n’ may happen in 
parallel in P if there is an execution p of P, and instances of 7, and 72 in p which 
execute along the paths m and 7’ respectively, in such a way that the paths 7 
and 7’ interleave (that is, either 7’ begins after 7 has begun but not yet ended; 
or vice-versa). 
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We now define when two statements sı and s2 (corresponding, to instructions 
ty = (l, c1, l4) and t2 = (I2,c2,14)) in tasks 7, and 72, respectively, may happen 
in parallel. Consider the program P’ obtained from P by enclosing the statements 
sı and s2 in skip statements. Formally, we obtain P’ by replacing the instruction 
lı by the instructions (l1, skip, m1), (M1, c1, M4), and (m4, skip, l{), where mı 
and m} are new locations in Loc, ; and similarly for tg. Let mı be the path 
l skip m 3 mi EP I, in G,,, and similarly m2 in G,,. We now say sı and s2 
may happen in parallel in P, if the paths mı and m2 may happen in parallel in 
the program P’. 

Two statements are called conflicting if they are read/write accesses to the 
same variable, and at least one of them is a write. We say two statements sı and 
S2 in P are involved in a data race (or are simply racy) if they are conflicting 
accesses that may happen in parallel. As an example, in the example program 
of Fig. 3, the accesses to obstacle in lines 10 and 20 are conflicting. Without 
any assumptions on the execution time of these two tasks, these two statements 
are also racy, since there is an execution of the augmented program in which the 
skip-blocks around these two statements interleave. 

Finally, we define what it means for a “block” of code to happen in parallel 
with another. A block of code in P is specified by a pair (l, X), where for some 
task 7 in P, Lis a location in Loc, and X C Loc, is a subset of locations reachable 
from l, in task T. An initial path in a block B = (l, X) of a task 7 in P, is a non- 
empty path in G, that begins at l and stays within the set of locations X, except 
possibly for the last location in the path. We say a statement s = (m,c,m’) in 
P belongs to block B = (l, X) if m belongs to the set X. We say two blocks Bı 
and B2 of P may happen in parallel if there are two initial paths 7, in Bı and 
T2 in B2, which may happen in parallel with each other. Otherwise, Bı and B2 
are disjoint. 


5 Response Time and its Computation 


Our aim in this section is to give a way of computing a safe bound on the 
response time of tasks in a periodic program with locks. We begin by recalling 
some of the basic notions. 

Consider a sequential piece of compiled code B executing on a given hardware 
platform. Assume that the code does not have to compete for the processor time 
with other processes (in particular there is no preemption, and lock statements 
succeed without blocking). The execution time of B may still vary depending 
on reads of input and other shared locations, which are assumed to return non- 
deterministic values during the execution. If we consider the supremum of these 
execution times we obtain the worst-case execution time (WCET) of B on the 
given platform. There are many static analysis techniques and tools that help us 
obtain conservative estimates on the WCET of a program on a given platform. 
We refer the reader to [21] for a survey of these techniques and tools. 

Let us now consider a periodic program P = (V,L,7) which we want to 
execute in a given execution environment. Let 7 be a task in 7. Consider an 
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execution p of P in this environment. There could be many instances of T exe- 
cuting in p. Let us consider one such instance, where at time t, 7 moves into the 
ready queue with the program counter pointing to its start location. Let t be 
the time at which this instance completes (that is 7 executes its end instruction). 
Then the response time of this instance of 7 is t/ — t. We are interested in the 
worst case response time (WCRT) of T which is defined to be the supremum of 
the response time of instances of 7 over all instances of 7 and all executions of 
P in the given environment. 


In a similar way we can define the WCRT of a block of code B in 7, where 
we take the initial time t to be time the instance of 7 is in the ready queue with 
the program counter pointing to the beginning of B, and t to be the time the 
last instruction of B completes. 


We note that the response time of a task (or a block of code) may exceed its 
WCET, as the task may lose processor time due to preemption by higher priority 
tasks, or due to blocking lock attempts. To illustrate this, consider a periodic 
program with three tasks 7, (priority 1, period 20), 72 (priority 2, period 13), and 
T3 (priority 3, period 8). Suppose the tasks have a simple structure comprising 
straight-line code, and each of them takes and releases a common lock l. Let the 
WCET for each segment of the tasks be as shown in Fig. 5. Consider a portion of 
a possible execution of P shown in Fig. 6. We note that 72, which has a WCET 
of 3, is ready to run at time 39 but completes execution only at time 44. Thus 
its response time in this instance is 5. This was due to the 2 units of processor 
time taken away by task 73 in its interruption during 72’s execution. Notice also 
that the top priority task 73 is delayed by 1 unit of time waiting for Tə to release 
the lock it had acquired before it was preempted. 


lock(1) unlock(1) 


| Be 
T3 (| 
1 05 0.5 
lock(1) unlock(1) 
| B? | 
T2 mm ey 
0.5 1.5 1 
lock(1) unlock(l) 
| B? | 


T1 p—— | ——y 
1 1 1 


Fig. 5: Block WCETs of tasks of example program 


We say a periodic program P is schedulable if the WCRT of each task is 
less than or equal to its period. However, since it is difficult to know the exact 
WCRT, we will look for a conservative WCRT estimate which is less than or 
equal to the period of the task, to declare that a program is schedulable. 
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T3 


Fig. 6: Illustrating response time 


5.1 Computing Response time without Locks 


In the classical setting of periodic programs without locks a conservative estimate 
of the WCRT for each task can be computed using Eq (1) below [12,13]. Let 
P =(V,L,T) be a periodic program. We assume for convenience in the rest of 
this section that P has tasks 7),...,7, with distinct priorities (we ignore the 
init task). Without loss of generality we assume 7; has priority i. Further, each 
task 7; has a WCET estimate C;. Consider the equation below from [12] which 
in turn is based on the analysis in [13]. Here the R,’s are variables representing 
the WCRT of task 7; respectively. 


Ri = Ci + X ([Ri/ Tj] - C5). (1) 


j>i 


Theorem 1 ([12,13]). The least solution to Eq 1, whenever it exists, is an 
upper bound on the WCRT of task 7;. 


Proof. Let L be any solution to Eq (1). We argue that L must upper bound the 
response time of any instance of task 7;. Consider an instance of task 7; that 
is enabled (enters the ready queue) at time t. Consider the time point t + L. 
If we ask ourselves how much processor time can be taken away in the interval 
[t,t + L] by a higher priority task 7;, it is clearly bounded by [L/T;|-C;. Thus, 
the total time that can be taken away by all higher priority tasks put together 
is bounded by >) ,.;({Z/Tj| - Cj). This leaves at least C; time for task 7; to 
execute, and hence it must complete execution by t + L. 


Algo. 1 below, which is similar to the recursive procedure proposed in [12], 
computes the least solutions to Eq (1) to compute conservative estimates of the 
WCRT of tasks, and thereby tells whether a periodic program is schedulable or 
not. 


5.2 Computing Response Time with Locks 


Thm. 1 no longer holds (and Algo. 1 is no longer sound) when tasks are allowed 
to take locks. This can be seen from the example program and sample execution 
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Algorithm 1: Check Schedulability (No Locks) 


Data: Periodic program P without locks, WCET estimates C; for 7; 
Result: P schedulable or not, and if so WCRT estimate for each task 
foreach task 7; do 
ce’ = 0; 
Li := Ui; 
while (L; is not a solution to Eq (1) and Li < Ti) do 
tmp := Li; 
Li := Li + Dy ((Li/T5] — [L7 /T5]) C4); 
LP? :—= tmp; 
end 
if (L; does not satisfy Eq (1) or Li > T;) then 
| return “Unschedulable”; 
end 


end 
return “Schedulable”, Li,..., Dn; 


in Figs. 5 and 6, where for instance task 73 has a response time of 3, but the 
least solution to the corresponding Eq (1) is 2. However, as we show below, it is 
possible to extend the classical approach to handle non-nested locks. 


Before we consider the general case, it will be instructive to first consider the 
example program of Fig. 5. Let C1, C2,C3 stand for the WCET estimates for 
tasks 71, T2, T3 respectively, and C}, C?, C? for the WCET estimates of the blocks 
B', B?, B® respectively. Let us first begin by asking what is the response-time 
of the block Bt. Recall that this is the portion of code between the lock(I)- 
unlock(l) statements in 71. Since B! does not contain any lock statements, the 
response time for this follows Eq (1), and we can write Eq (6) to capture its 
response time, U}. In a similar way the response time, U7, of the block B? is 
given by Eq. (5). It is easy to see that the response time, UŽ, of the block B? in 
the highest priority task 73 is simply C7. 


Next, we consider the top priority task 73. The only extra time it may spend 
is in waiting for its lock(1) instruction to succeed. This may happen because one 
of the lower priority tasks has acquired lock | and is yet to release it. Suppose 
this task is T2. Then 72 must be somewhere in block B?. But how long can it be 
before Tə releases 1? This is at most the response time for B?. In a similar way, 
if 7, has taken the lock, 73 may end up waiting for at most the response time of 
Bt. Note also that 73 may have to wait for at most one of T> or Tı to complete 
its lock block, never both. Thus, its response time is given by Eq (2). 


Now let us consider task 72. It may be delayed either (a) waiting for its 
lock(l) statement to succeed because 7, has taken the lock l; or (b) because 73 
takes away some time by preempting it. The former is bounded by the response- 
time of B!, while the latter is bounded by the number of times 73 can interrupt 
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it times the WCET of 73. Thus the response time of Tə is captured by Eq (3). 


R3 = C3 + max(U7, U;) 

Rə = C2 + U} + [Re/ T3] - C3 

Rı = C1 + [Ri/T3] -C3 + [Ri/T2|- C2 
U? = C? T [UF / T3] C3 
Ul = Ci + [Up / T3] -Cs + [U}/ T2] - Ce 


N 


SN NS 
A 


D 


) 
) 
) 
) 
) 


To find the least solution to Eqs (2-6), we can apply the analogue of Algo. 1 
to first compute U? = 3.5 and U} = 6 using Eqs (5-6). We can now use these 
values to compute the values Rı = 8, Ro = 13, and R3 = 8. Since these are 
within the respective time periods of the tasks, we declare that the program is 
schedulable. 

We can now tackle the general case. Consider a periodic program P = 
(V, L, T) satisfying the following assumptions (in addition to distinct priorities): 


— P does not use nested locks. In particular, each task 7; has a finite number 

of lock(I)-blocks Bj,,-.-, Bin, ,, with m, > 0, for each lock variable | € L. 
These blocks are pairwise disjoint. 

— There is a bound N on the number of times 7; takes lock l in any of its 
executions. 


— The WCET of each task 7; is C;, and of each block Bi. g is ci gi 


The equations below capture the WCRT of the tasks and lock blocks of P. 
The variables here are the R;’s representing the WCRT of task 7;, and the Uf, ’s 


representing the WCRT of blocks Bi p respectively. 


Ry = 0; + (Nf max Uf) + TUR: T; C3) (7) 
IEL 4 >i 
Uli x = Che + So ((Uin/ Ty] - C3) (8) 
j>i 


Theorem 2. The least solution to the system of Eqs (7,8), whenever it exists, 
is an upper bound on the corresponding WCRT of tasks 7; and the blocks Bi p- 


Proof. Once again we show that any solution to the systems of equations (7) and 
(8) is an upper bound on the WCRT of the tasks and lock blocks of P respec- 
tively. Let L1,..., Ln and Li, (for i € {1,...,n}, L € L, and k € {1,...,n1;}) 
be a solution to the equations above. We first argue that the WCRT of a block 
Bj, is bounded by L} „. Since the block is free of lock statements, this is like the 
classical case and a similar argument to Thm. 1 applies to conclude that Lik is 
an upper bound on the WCRT of Bj,. 

To argue that the WCRT of task 7; is bounded by L;, consider an execution 
of an instance of task 7; where it is made ready at time t. Consider the time 
interval t to t+ L;. We claim that 7; must finish its execution before t+ L;. Task 
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Ti may lose time because of two reasons: (a) it is blocked on one of its lock(l) 
instructions because some other task 7 has taken the lock l. Now it must be the 
case that 7 is a lower priority task than 7;. Suppose 7 had a higher priority than 
i. Then either it must have got blocked after acquiring l and before releasing 
it, or it was preempted by a still higher priority task 7’. The former case is 
ruled out since we don’t allow nested locks. We can now apply similar reasoning 
to rT’, and so on; but the buck must stop at the highest priority task. Since it 
cannot be preempted, it must be blocked waiting to acquire another lock; this is 
a contradiction to our no nested lock assumption. Thus, the total time that can 
be taken away due to 7; waiting for a lock is bounded by )0)<,,(N}-max;<; Liz) 
(corresponding to the second term in Eq. (7)). The second reason 7; may lose 
time is (b) because of preemption by higher priority tasks. Like before, this is 
bounded by >7,.,([Li/T;| - Cj) (the third term in Eq. (7)). Thus, there must 
remain at least C; amount of time in the interval t to t+ L; for 7; to execute, 
and hence it must complete execution before t + Li. 


Algo. 2 is an algorithm to compute the least solution to the system of 
Eqs. (7,8), and check schedulability of a periodic program with non-nested locks. 


6 Rules for Disjointness 


In this section we describe a set of rules which tell us when two tasks of a periodic 
program are disjoint (that is, can never happen in parallel). We will then use 
these rules to propose a race-detection algorithm for periodic programs. 


6.1 Disjoint Block Rules 


Let P = (V,L,7) be a periodic program that (a) satisfies the no-nested-lock 
condition of Sec. 5.2, and (b) has WCRT estimates R, for each task 7 satisfying 
R+ < T, (that is, P is schedulable). The rules below tell us when two whole task 
bodies, or two blocks within them, are disjoint. Fig. 7 illustrates Rules 1-5. 


— Rule 1 (Same-Priority): Let T and T’ be two distinct tasks in T such that: 
e r and T' have the same priority (i.e. pr = pr); and 
e Neither T nor T' shares a lock with a lower priority task. 
Then T and T’ are disjoint. 


— Rule 2 (Same-Period): Let r and T’ be two distinct tasks in T such that: 
e 7 and T' have the same period (i.e. T- = Tr ); and 
e Neither T nor T' shares a lock with a lower priority task. 
Then T and 7’ are disjoint. 


— Rule 3 (Low-Multiple-of-High): Let 7 and Th be two tasks in T such that: 
e 7 has a lower priority than Tp; (i.e. Pr < Dr, ); 
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Algorithm 2: Check Schedulability With Locks 
Data: Periodic program P with locks, WCET estimates C; for 7; and Ci, k for 


lock block Bix 
Result: P schedulable or not; if schedulable, WCRT estimates for each task 


foreach block Bix do 

n sO 

Lik = Cir 

while (Lj, does not satisfy Eq (8) and Li p < Ti) do 
tmp := Li g; 

Lir = Lir + Uys Lie/ Ti] — [Lik / Ts) + Ca); 
Lp” := tmp; 

end 

if (Li, does not satisfy Eq (8) or Li p > Ti) then 
return “Unschedulable”; 


end 

end 

foreach task 7; do 
DP := 0; 


Li = C; + poy aii . Max; <i is) 5 
while (L; does not satisfy Eq (7) and Li < Ti) do 


tmp := Li; 
Li := Li + Vy (L/T — [L7 / Til) Cy); 
LP" := tmp; 

end 


if (L; does not satisfy Eq (7) or Li > T;) then 
| return “Unschedulable”; 
end 


end 
return “Schedulable”, L1,..., Dn; 


e The period of 7 is a multiple of the period of T, (i.e. Ta = k- Tn, for 


some k € N); 
e tT, does not share a lock with a task of lower priority than 7); and 


e The WCRT estimate R,, of 7 is at most the period of Tp (i.e. Rn < Tr, ). 


Then T, and Tp are disjoint. 


— Rule 4 (High-Multiple-of Low): Let 7 and Tp, be two tasks in T such that: 


e 7 has a lower priority than Tp; 
e The period of Ty is a multiple of the period of ™|; and 
e tT, does not share a lock with a task of lower priority than Tı. 


Then T, and Tp are disjoint. 


— Rule 5 (Low-WCRT): Let m, and Tn be two tasks in T such that: 


e 7 has a lower priority than Tp; 
© 7 and T, have periods such that neither is a multiple of the other. 
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e 7, does not share a lock with a task of lower priority than T4. 
e Let m be the minimum strictly positive value in the set 


{(k- Tn) mod T, | k €N} 


(note that such an m must exist by the second condition above). The 
WCRT estimate R,, of | is at most m (i.e. Ra < m). 
Then T, and Tp are disjoint. 


— Rule 6 (Lock): Let B; and B; be two lock(J)-unlock(l) blocks in distinct 
tasks 7 and 7’ respectively. Then B, and B; are disjoint. 


We now show that Rules 1-6 are sound. 


Theorem 3. Consider a periodic program P, with no nested locks, and WCRT 
estimates which make it schedulable. Consider two blocks which satisfy the premise 
of one of the rules; then the identified blocks are indeed disjoint in P. 


Proof. Let us fix a periodic program P without nested locks, and with WCRT 
estimates R, for each task 7 in P, which witness the schedulability of P. Now 
suppose 7 and 7’ are two tasks in P satisfying the premise of Rule 1, namely 
that they have the same priority and neither of them shares a lock with a lower 
priority task. Now if there were no higher priority tasks and 7 and 7’ took no locks 
at all, then clearly 7 and 7’ can never overlap in their execution instances, since 
neither can preempt the other. However, even if there was a higher priority task 
say T”, note that by our scheduling semantics, if 7” were to interrupt 7 during 
its execution, 7 would resume execution ahead of any other tasks of the same 
priority that may be ready. So 7 and 7’ cannot interleave due to the preemption 
by a higher priority task. The other possible cause for interleaving could be 
because say T gets blocked while trying to take a lock / that is already held by 
some other task of higher or lower priority. However, as argued earlier, a higher 
priority task holding / is ruled out. The case of a lower priority task holding l is 
ruled out by the premise of Rule 1. Thus it follows that + and 7’ cannot overlap 
in any execution. The soundness of Rule 2 follows a similar argument. 

For Rule 3, suppose the period of 7 is a multiple of Th. Let us say 7; is made 
ready at some time t (which must be a multiple of its period T,,). Now either 
t is also a multiple of T,,, in which case Tp will begin execution before 7, or 
Tp is next scheduled at some time t > t. In the former case, the only reason Tp 
may not complete before 7; gets to execute, is that Tp is blocked on acquiring a 
lock. As in earlier arguments, this lock can only have been acquired by a task 
of priority lower than 7;. But this is ruled out by the premise of the rule. In the 
latter case, by the premise of the rule, t + R,, < t. Hence 7 will complete its 
execution before 7, can preempt it at t. 

For Rule 4, suppose T,, is a multiple of T;,. Consider a time t when 7; 
is made ready. If T, is not also enabled at t, then by schedulability, 7; must 
complete before t+ T,,, which is before the time m, is enabled next. Hence they 
cannot overlap in this case. If 7, is also enabled along with 7; at t, then it must 
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Fig. 7: Illustrating Rules 1-5 
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begin execution before 7; does. The only reason it may not complete before 7; is 
allowed to begin execution, is that it is blocked on a acquiring a lock l held by 
a task of lower priority than 7;. But this is ruled out by the premise of the rule. 

For Rule 5, again consider 7 and Tp satisfying the premise of the rule. Let t 
be a time point where 7 is made ready. Either t is a multiple of T;,, in which 
case Tp is also made ready at the same time; or it is not, and arrives at some 
time t later than t. The former case is similar to the situation considered in 
earlier cases, and the instances of 7; and T, cannot overlap. In the latter case, by 
the premise of the rule, we must have t+ R,, <t+m < t’, and hence 7 would 
finish its execution by t’, and the two tasks cannot overlap. The soundness of 
Rule 6 is standard. 


6.2 Computing the value m in Rule 5 


Rule 5 requires us to compute the value m which is the smallest positive re- 
mainder that we can get by dividing an integral multiple of T,, by Ta. It is 
not difficult to see that all possible remainders must occur in the interval [0, T] 
where T is the LCM of T, and T,,. Thus it is sufficient to look at the multiples 
of T, upto T, and set m to be the minimum positive remainder we get by 
dividing these by Tn. 


6.3 Race Detection Algorithm 


We now present the algorithm to detect races in periodic programs. Algo. 3 first 
identifies the set of shared variables accessed in the program and then lists all 
the conflicting access pairs, which are all assumed to be potentially racy initially. 
The algorithm, using the rules in Sect. 6 and the lockset analysis, described next, 
then prunes out the pairs of accesses found to be non-racy. 

An iterative lockset analysis computes the set of locks held at each statement 
in a program P. At the program entry, it is assumed that no locks are held. For 
the lock(/) command, locks held are the set of locks held before this command 
along with the lock l. For the unlock(J) command, locks held are the set of locks 
held before this command with the lock l removed. For any other command, the 
lockset remains the same as held in the previous command. The join operation, 
in this analysis, is the intersection of locksets. 

The algorithm uses the notion of covers which needs further explanation. Let 
Tı and T2 be two tasks in a periodic program P and sı and s2 be two statements 
in P. We say the pair of tasks (71,72) covers the pair of statements (51, $2) if 
either sı is a statement in G,, and s is a statement in G, or vice versa (i.e. sı 
in Gr, and s2 in G,,). 


7 Experimental Evaluation 


In this section we first describe the implementation of Algo. 3 to detect races 
in periodic programs. We then explain the benchmarks used to evaluate the 
implementation followed by a discussion of the results. 
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Algorithm 3: Race Detection 


Data: Periodic program P 
Result: List of potential races PR 
Identify the set of shared variables V; 
Find the list CA of conflicting accesses on V; 
PR := CA; 
Find list DT of disjoint tasks using rules in Sec. 6; 
foreach pair (81,82) of conflicting accesses in PR do 
if there is a pair (T1, T2) of tasks in DT, such that (71,72) covers (81, 82) 
then 
// (81, $2) are non-racy 
PR := PR — {(s1, s2)}; 
end 
end 
Perform lockset analysis on each task in P; 
foreach pair (s1, s2) of conflicting accesses in PR do 
let Lı be the lockset at sı and Lə be that at s2; 
if Lı N Le #40 then 
// (s1, $2) are non-racy 
PR := PR — {(s1, s2)}; 
end 


end 
return PR; // Set of potential races 


7.1 Implementation 


We implemented Algo. 3 in the tool PEPRACER [19] as shown in Fig. 8. The 
tool has a preprocessor, which inlines the functions in the input program, a time 
analyzer which computes WCET of tasks using Heptane [11], and then their 
WCRT using Algo. 2. The CA generator identifies the shared accesses, which 
are essentially accesses to global variables or shared locations through pointers, 
in the program, and then lists the conflicting access pairs. The Rules Checker 
identifies disjoint task pairs using the response times and eliminates conflicting 
accesses that are non-racy. The rules, described in Sec. 6, are applied on the 
conflicting accesses to eliminate non-racy pairs. The Lockset Analyzer computes 
the locks held at each statement in the program and further eliminates the 
remaining conflicting accesses that are non-racy. The tool finally displays the 
potentially racy pairs. 

We implemented PEPRACER in the OCaml based C Intermediate Language 
(CIL) static analysis framework [15]. The Inliner step in PEPRACER uses the 
built-in inline pass in CIL while the lockset algorithm and Rules Checker are 
implemented as new passes in CIL. The implementation of the WCET Analyzer 
is explained next. 


WCET Analysis WCET analysis was carried out on the benchmarks using the 
Heptane [11] tool. Heptane accepts inputs in the form of C programs. To prepare 
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Fig. 8: Schematic of PEPRACER 


the benchmark programs the following modifications were made to them: All 
non-C constructs in the benchmarks were translated to suitable C constructs, e.g. 
TASKs in OSEK programs were converted to correspondingly named functions. 
All code was merged into a single C file. Some benchmark programs did not 
have the source for some of their parts. Heptane needs the source code for the 
entire program being analysed. Hence, all code for which source code was not 
available was replaced with minimal stubs. Loop bounds were provided using 
ANNOT_MAXITER as required by Heptane. These loop bounds were computed by 
manual inspection. 

For each benchmark the WCET was separately computed for each of its 
task entry functions. Heptane supports WCET analysis for ARM and MIPS 
architectures. Where possible, WCET was run using default settings for both 
architectures. The difference between the WCET results for both architectures 
were found to average around 4%, never exceeding 20%. In our analysis we use 
the values for the ARM architecture. 

Some aspects which may lead to our WCET estimates not being conservative 
are as follows: 


1. Stub functions were used for those parts of the code whose source was not 
available. This accounts for < 1% of the total code analysed. 

2. Loop bounds were defined using manual inspection. 

3. A small number of lines of code had to be masked to prevent Heptane from 
crashing. 


For more accurate WCET analysis, data corresponding to the specific target 
architecture being considered should be used. Several WCET analysis tools are 
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available [21] both in the commercial and academic domain. The choice of the 
analysis tool would influence the accuracy of the WCET analysis. 


7.2 Benchmarks 


We tested the implementation on a few benchmark periodic programs shown 
in Table 2. Most of the real-world periodic programs are proprietary and diffi- 
cult to gain access to. Hence we resorted to some programs from the nxtOSEK 
benchmark set, lego-osek-master project, ev30SEK benchmark set, nxt-osek- 
sumo-master project, AADLib benchmark set [1] and examples in [10] and [14] 
for evaluation of the tool. The programs in AADLib are configured to run on 
FreeRTOS while the others are designed to run on the OSEK real time oper- 
ating system. The program fse_obstacle.c implements a simplified version of 
a robotic controller which detects obstacles in its proximity while avionics.c 
specifies the general functions, data interactions, and timing constraints for a hy- 
pothetical avionics Mission Control Computer (MCC) system. Biped_robot.c 
is a sample program for LATTEBOX NXTe/LSC based biped robot. Sumo.c 
implements a robot which attempts to push its opponent out of a circle. A Blue- 
tooth based radio controlled car is implemented in nxtgt.c. In lego_osek.c 
a robot detects obstacles and avoids collision by changing angle and speed. 
Objectfollower.c implements a follower. It goes forward as an object goes 
forward; when the object stops moving, it stops as well, and follower.c is 
similar. A two wheeled self-balancing radio controlled robot is implemented in 
nxtway_gs.c. Ardupilot.c, taken from [1], is a simple version of the popular 
autopilot system supporting many vehicle types. sumoR.c and carR.c are racy 
versions of the programs sumo.c and car.c respectively. 

We have annotated the programs with task attributes like periodicity, prior- 
ity, and WCET time, along with details of locks held. The non-periodic tasks 
in some of the programs are taken to be tasks with high period. We have in- 
lined the helper functions called in the tasks along with the calls to library 
functions. This will bring out the accesses to shared structures in the library. 
For example, the ecrobot library function ecrobot_set_motor_speed, which is 
called in lego_osek.c, accesses the shared NXT_PORT_A port. The GetResource 
and ReleaseResource functions used to take and release locks, respectively, are 
taken to be the lock and unlock command in our analysis. It is to be noted 
that in OSEK, resources are locked according to the Priority Ceiling Protocol 
(PCP). But for our evaluation, we assume these programs are using standard 
locks. We believe the placements of locks would not change even if the developer 
were using standard locks. FreeRTOS supports the use of standard locks. 


7.3 Results 


We ran our tool on the benchmark programs on an Intel Quad Core i7-3770 
3.40GHz machine running Ubuntu 18.04.4. Table 2 shows the results of running 
our tool. The “Tasks” column gives the number of tasks in the program, “Sched.” 
gives whether the program is schedulable or not (by Algo. 2), the number of 
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conflicting accesses in a program is listed under the “CA” column, and the count 
of potentially racy pairs are given under the “PR” column. The “%Elim.” column 
gives the percentage of conflicting accesses that are found to be non-racy. The 
last column gives the time taken by the tool, which was calculated using the 
Linux time command. 


Table 2: Results 


Program LoC|Tasks|Sched.} CA|PR a Time 

Elim. | (sec) 
fse_obstacle.c 24 2) Y 3| 0} 100} 0.12 
avionics.c 588 15} N 51| 42 18} 0.13 
biped_robot.c | 340 3 Y 1) 0| 100| 0.22 
sumo.c 5287 4 Y 146} 0; 100| 0.32 
nxtgt.c 209 4 Y 3} Of} 100} 0.21 
lego_osek.c 2036 2| Y {1320} 0) 100] 0.12 
objectfollower.c|1878 3 Y 14; Of; 100| 0.31 
nxtway_gs.c [|2263 3 Y 4) 0| 100) 0.37 
car.c 1329 4 Y 670| 0| 100| 0.28 
ardupilot.c 1392 4 Y 17) 0| 100| 0.24 
follower.c 2769 7; Y 41179) 0} 100| 0.30 
sumoR.c 5287 4 Y 146| 77) 47| 0.31 
carR.c 1329 4 Y 670/125) 81| 0.28 


Our tool detects the avionics.c program to be non-schedulable, which is 
also detected by [14]. Rules 3, 4, and 5 depend on the response times of the tasks 
and we bypassed the application of these rules for avionics.c. The “PR” column 
in the table for avionics.c gives the count of potentially racy pairs detected 
after the application of other rules. The last two rows of the table shows the 
data for some of the benchmarks which have been modified to make them racy 
by changing the periods, execution times, etc. Our tool is able to filter out a 
large part of the conflicting access (CA) pairs as non-racy (on an average 97% 
of CA pairs are eliminated). 

Table 3 gives the coverage of the rules (Rules 1-6). Here each rule is indepen- 
dently applied on the conflicting accesses to demonstrate the value of each rule 
separately. Column “R1” gives the count of CA pairs flagged as non-racy due to 
Rule 1 only. The case is similar with other columns. Recall that the non-trivial 
rules like Rules 3-5 use periodicity and/or response time to declare CA pairs as 
non-racy. A careful analysis of the count for these in Table 3 reveals their use- 
fulness in flagging non-racy pairs. Some pairs are detected by these rules while 
not covered by the other simpler rules. It is even worthwhile observing that the 
CA pairs detected as non-racy by Rule 6 (the one based on locks) are covered 
by other rules. The developers can use this information to decide on whether to 
use expensive constructs like lock-unlock to ensure mutual exclusion when the 
task periodicity and response time can themselves ensure it. 


Static Race Detection for Periodic Programs 313 


Table 3: Rule Coverage 


Program CAs|R1] R2| R3| R4| R5| R6 
fse_obstacle.c 3] of 0 3 0 0 0 
avionics.c 51} O 9 -| =| - 0 
biped_ robot.c 1} Of O 0o O| 1 1 
sumo.c 146| 35} 69) 69) 69)112 6 
nxtgt.c 3 0 0} 3) O 0 


0 

lego_osek.c 1320| 0 

objectfollower.c| 14| 0 
nxtway_ gs.c 4} 0 0 4 0 0 0 

0 

0 

0 


car.c 670 90) 133)164/463} 117 
ardupilot.c 17 17} 17| 17| 0 0 
follower.c 1179 144| 144|204|975 4 
sumoR.c 146| 35} 69| 69| 69| 35 6 
carR.c 670} Of 0 0| 74|463} 117 


8 Related Work 


We begin with work related to computing response times and schedulability 
analysis. Apart from the work of [13,12] already mentioned, feasibility analysis 
for real-time periodic tasks without locks have been studied by Baruah et al [4] 
and Pellizzoni and Lipari [16]. Baruah [3] studies schedulability under Earliest 
Deadline First and Stack Resource Policy (EDF+SRP) and gives an efficient 
algorithm for checking schedulability. Bertogna et al [5] study resource holding 
times (how long a task may hold on to a lock/resource) and give algorithms for 
computing and minimizing these times. 

In closely-related classical work on real-time systems that use locks, Sha et 
al [18] consider a very general setting of priority-based preemptive scheduling, 
with FCFS among waiting tasks of the same priority (similar to our setting), 
with arbitrarily nested locks, and give sufficient conditions for schedulability of 
programs under these conditions. However the locks they consider are priority 
inheritance based locks which elevate the priority of a task if it is in a critical 
section to a level based on the priorities of the tasks waiting for (or that might 
acquire) this resource. Programs with such locks have the useful property that 
the blocking time of a task is bound by the longest WCET of a lock block (critical 
section) of a lower priority task. This facilitates their analysis and bounds on 
response time. In our setting of standard locks (though restricted to be non- 
nested) it is not clear if such properties can be exploited. 

Related work on verification of periodic programs can be broadly classified 
into two categories: Verification of periodic programs using techniques like model 
checking, symbolic execution etc., and detecting data races in programs for em- 
bedded applications similar to periodic programs, using static analysis tech- 
niques. 

Periodic programs with tasks prioritized in a rate monotonic fashion and 
communicating using shared variables, have been verified against safety proper- 
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ties using bounded model checking with different kinds of locks in [7], [6] and [8]. 
In their first paper of the series [7], the authors provide a time-bounded verifica- 
tion of safety properties where the sequentializations of programs are considered 
with respect to number of jobs of each task within the time bound. Priority 
and preemption locks are considered in [7] and the work is extended to include 
Priority Inheritance Protocol (PIP) locks in [8]. [6] proposes a new sequential 
composition mechanism to reduce the number of sequentializations and make the 
bounded verification scalable. However, the verification is bounded to a certain 
depth, and in general cannot be used to soundly detect all data races. 

PLC programs are very similar to our periodic programs and are widely used 
in embedded safety critical software. Symbolic execution of PLC programs is 
developed in [10] where the authors convert PLC programs into C programs 
and use their rate-monotonic, priority-based, preemptive scheduling semantics 
to reduce the number of inter-leavings considered. The only way to use their 
symbolic execution to detect data races would be for the developer to introduce 
a counter for each shared variable and increment and decrement this counter, 
and then check for violations of assertion that encode a racy accesses to these 
variables. This technique is unlikely to be scalable. 

Static analysis based techniques for detecting data races embedded soft- 
ware kernels and applications have been of recent research interest [17], [9], 
[20]. Schwarz et al [17] provide an algorithm to detect data races in multi-task 
programs with priority ceiling locks. Additional synchronization mechanisms in- 
cluding dynamic threads, suspend-resume of scheduler and tasks etc. are consid- 
ered in [20]. Both these works exploit priorities and locks, but do not consider 
periodicity and WCRT information like we do, and would lead to less precise 
results on the class of periodic programs considered in this paper. 


9 Conclusion 


In this work we have proposed a technique for statically detecting data races 
in periodic real-time programs with locks. Our contribution includes a response 
time analysis for such programs when the locks are used in a non-nested man- 
ner. Going forward, some interesting directions include using the insights in this 
paper to perform precise and efficient data-flow analysis for such programs; im- 
proving the tightness of the response time analysis; and extending the technique 
for detecting high-level races for the class of such programs and for periodic pro- 
grams with other locking mechanisms like priority-inheritance based locks, and 
other scheduling policies. 
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Abstract. We present Probabilistic Total Store Ordering (PTSO) — a 
probabilistic extension of the classical TSO semantics. For a given (finite- 
state) program, the operational semantics of PTSO induces an infinite- 
state Markov chain. We resolve the inherent non-determinism due to 
process schedulings and memory updates according to given probabil- 
ity distributions. We provide a comprehensive set of results showing the 
decidability of several properties for PTSO, namely (i) Almost-Sure (Re- 
peated) Reachability: whether a run, starting from a given initial configu- 
ration, almost surely visits (resp. almost surely repeatedly visits) a given 
set of target configurations. (ii) Almost-Never (Repeated) Reachability: 
whether a run from the initial configuration, almost never visits (resp. 
almost never repeatedly visits) the target. (iii) Approximate Quantita- 
tive (Repeated) Reachability: to approximate, up to an arbitrary degree 
of precision, the measure of runs that start from the initial configuration 
and (repeatedly) visit the target. (iv) Expected Average Cost: to approx- 
imate, up to an arbitrary degree of precision, the expected average cost 
of a run from the initial configuration to the target. We derive our results 
through a nontrivial combination of results from the classical theory of 
(infinite-state) Markov chains, the theories of decisive and eager Markov 
chains, specific techniques from combinatorics, as well as, decidability 
and complexity results for the classical (non-probabilistic) TSO seman- 
tics. As far as we know, this is the first work that considers probabilistic 
verification of programs running on weak memory models. 


1 Introduction 


The classical Sequential Consistency (SC) semantics [1] has been a fundamental 
assumption in concurrent programming. SC guarantees that process operations 
are atomic. A write operation, performed by a given process, is immediately 
visible to all the other processes. However, designers of modern computer sys- 
tems, in their quest of increased system efficiency, often sacrifice the SC guaran- 
tee. Instead, the processes communicate asynchronously, allowing a delay in the 
propagation of write operations. Due to the propagation delay, written values 
can become available to processes at different time points, and in an order that 
may be different from the order in which they are generated. This asynchronous 
behavior gives rise to new semantics, collectively referred to as weak memory 
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models [2]. In the presence of weak memory models, programs exhibit new, and 
often unexpected, behaviors, bringing about complex challenges in the design 
and analysis of concurrent systems. Even text-book programs may behave erro- 
neously. The classical Dekker mutual exclusion protocol is a case in point. The 
ubiquity of weak memory models has led to an extensive research effort for the 
testing and verification of concurrent programs running under such semantics. 


Existing works on the verification of programs running on weak memory 
models, consider safety properties such as state reachability, assertion violation, 
and robustness. While safety properties are fundamental, we need also to prove 
liveness properties, i.e., to show that the program indeed makes progress. This 
is, of course, true already in the case of SC. A program, such as a mutual 
exclusion protocol, needs to guarantee that each process will eventually reach 
its critical section. The satisfiability of liveness properties is often dependent 
on the type of fairness conditions on process executions that are provided by 
the underlying platform [3,4]. The reason is the presence of concurrency non- 
determinism, i.e., the inherent non-determinism in program behavior due to the 
different possible ways in which the scheduler can interleave the processes. The 
scheduler may always neglect a given process, which means that the process 
will never make progress (e.g., never reaches its critical section). Therefore, we 
need the scheduler to follow a fair selection policy that allows each process to 
advance in its execution. The situation is even more complicated in the case of 
weak memory models, since we also need to deal with a second source of non- 
determinism, besides concurrency non-determinism, namely (data) propagation 
non-determinism. Since write operations are propagated asynchronously, there 
is in general no way to predict if, when, and in which order, write operations 
become visible to the processes. 


In this paper we present a framework for the verification of liveness properties 
for concurrent programs running under the classical Total Store Ordering (TSO) 
semantics [5]. The TSO model puts an unbounded store (write) buffer between 
each process and the main memory. The buffer carries pending write operations 
that have been performed by the process. These operations are propagated from 
the buffer to the shared memory in a FIFO manner. When a process performs 
a write operation, it appends the operation as a message to its buffer. When 
a process reads a variable, it searches its buffer for a pending write operation 
on that variable. If such operations exist then it reads from the most recent 
one. If no such operation exists, it fetches the value of the variable from the 
main memory. The TSO propagation mechanism is a typical example of how 
propagation non-determinism arises: the write operations are propagated to the 
shared memory non-deterministically, and a process sees the other processes’ 
write operations only when the latter are available in the memory. Therefore, 
having a scheduler that fairly selects the processes is not sufficient. We also need 
to ensure that the write operations propagate to the processes sufficiently often. 


Traditional fairness conditions such as strong or weak fairness [3,4,6] can- 
not capture propagation policies adequately since they irrationally allow slow 
propagation, i.e., they allow write operations to propagate at a lower rate than 
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the rate by which they are issued. For instance, strong fairness guarantees that 
messages are transferred infinitely often from the buffers to the memory. Still, 
it does not constrain the relative frequency of write and update operations, and 
hence it does not prevent the buffer contents from growing unboundedly. In such 
a scenario, more and more un-propagated messages may be clustered inside the 
buffers, and a given process may, from some point on, be confined only to read 
its own writes, since it will not see the memory updates by the other processes. 
Accordingly, verifying liveness properties subject to strong fairness may wrongly 
deem the system to be incorrect: even if a process is selected infinitely often by 
the scheduler and write operations are propagated infinitely often to the mem- 
ory, a given process may incorrectly be judged not to make progress due to slow 
propagation. 

While slow propagation can arise theoretically under the above mentioned 
fairness conditions, it is almost never observed in practice. Existing platforms 
implement different policies, such as invalidation or write-back policies, to flush 
the buffers at regular intervals [7,8]. This prevents the buffer sizes from growing 
beyond certain sizes, and implicitly ensure propagation fairness. In fact, this is 
true to the degree that non-SC behaviors are (relatively) rarely observed on TSO 
platforms [9,10]. 

In this paper, we perform verification of liveness properties for concurrent 
programs under TSO using probabilistic fairness [11]. As far as we know, this 
is the first work that considers probabilistic verification of programs running 
on weak memory models. In our model, both process scheduling and message 
propagation are carried out according to given probability distributions. We as- 
sign a weight (a natural number) to each process. We resolve concurrency non- 
determinism probabilistically by letting the scheduler select the next process to 
execute with a probability that reflects the weight of the process compared to 
the weights of the other processes that are enabled in the same configuration. 
After each process step, we allow an update step, in which the buffers transfer 
parts of their contents to the memory. We make the probability distribution 
equal among all possible update operations in the given configuration*. As we 
will see later in the paper, defining the model in this way implies that we assign 
low probabilities to program runs that unboundedly increase the number of mes- 
sages inside the buffers. Accordingly, our model is more faithful to real program 
behavior compared to models induced by non-probabilistic fairness conditions. 

We perform a comprehensive analysis of the decidability of verifying liveness 
properties for concurrent programs running under the TSO semantics, subject to 
probabilistic fairness. In fact, verifying programs running on the TSO memory 
model, even with respect to safety properties, poses a difficult challenge. The 
unboundedness of the buffers implies that the state space of the system is infi- 
nite, even in the case where the input program is finite-state [12,13]. Similarly, 
the operational semantics of our model gives rise to Markov chains with infi- 
nite state spaces. Furthermore, in general, liveness properties give rise to more 
difficult problems than safety properties, since the former are interpreted over 


4 Our framework allows several other types of probability distributions (see Sec. 9.) 
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infinite program executions while the latter are interpreted over finite execu- 
tions. Our results rely on nontrivial combinations of results from the classical 
theory of (infinite-state) Markov chains [14,15], the theories of decisive and ea- 
ger Markov chains [16,17], specific techniques from combinatorics [18], as well 
as, decidability and complexity results for the classical (non-probabilistic) TSO 
semantics [19,13]. Concretely, we show the decidability of the following problems, 
each of which is defined by giving an initial configuration Yini and a set Target 
of process target states. 


Qualitative Analysis (Sec. 6). In qualitative reasoning, we are interested in 
knowing whether the given property is satisfied with probability 1 (almost surely 
satisfied), or with probability 0 (almost never satisfied). We show that the satis- 
fiability of these properties can be reduced to similar problems on the underlying 
(non-probabilistic) transition systems for classical TSO. The actual probabili- 
ties appearing in the induced Markov chains then are inconsequential and only 
their non-zeroness matters. This is useful whenever the probabilities have not 
been measured exactly, or the portion of the system giving rise to probabilistic 
behavior has not been designed yet. We consider the following different flavors 
of qualitative analysis: Almost-Sure (Repeated) Reachability”: whether a run of 
the system from Yini will almost surely visit (resp. repeatedly visit) Target; 
Almost-Never (Repeated) Reachability: whether a run of the system from Yinit 
will almost never visit (resp. repeatedly visit) Target. Furthermore, we show 
that all these problems have non-primitive-recursive complexities. 


Quantitative Analysis (Sec. 7). The task is to estimate to an arbitrary degree 
of precision the probability by which a run from Ņinit (repeatedly) visits Target, 
rather than only checking whether the probability is equal to one or zero. 
Expected Average Cost (Sec. 8). We study the expected cost for runs that 
start from 7iniz until they reach Target. To that end, we extend our model by 
providing a cost function that assigns a fixed cost to each instruction in the lan- 
guage. Calculating expected costs of runs has many potential applications. For 
instance, one might be interested in the mean-time of reaching a target, i.e., the 
average number of steps before reaching the target [20]. In the context of weak 
memory models, in general, and TSO in particular, one can perform a more re- 
fined analysis by also taking into account the fact that specific instructions, e.g., 
memory fences, have higher costs [21]. Incorporating instruction costs in the 
model makes average cost analysis reflect more faithfully the efficiency of the 
program compared to an instruction count based metric. There have been sev- 
eral approaches towards optimizing fence implementations in hardware [22,23,24] 
which exploit the fact that non-SC behaviours are rare even in unfenced code. 
A quantitative analysis of the prevalence of behaviours and cost of executing 
instructions can help determine the efficacy of such implementations. 


5 While repeated reachability is a liveness property, plain reachability in the non- 
probabilistic case is a safety property. However, in the presence of probabilities, 
plain reachability measures the probability of convergence towards a target state, 
and hence it can be considered a form of liveness property. In any case, this is a 
matter of definition and has no bearing on the rest of the paper. 
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The supplementary material [25] contains detailed proofs of all the lemmas 
and theorems. 

Related Work Only recently there has been an increased interest in the 
formulation and verification of liveness properties for weak memory models. In 
[26], they factor the system into a process and memory subsystems and define 
notions of fairness for either. This is reminiscent of our approach, where we 
consider probabilistic policies for process scheduling and memory update. Their 
model on the other hand is non-probabilistic and they have weaker fairness 
guarantees, which we describe in more detail in Sec. 5.1. The liveness verification 
problem for TSO has been considered in [27], where they show undecidability 
for various liveness properties. However, once again work with non-probabilistic 
notions of fairness. We show in this paper, that with stronger (probabilistic) 
fairness, reachability and repeated reachability problems become decidable. 

In [12], they show the undecidability of the repeated reachability problem, 
without fairness conditions, for finite-state programs running under the TSO se- 
mantics. In contrast, we show that checking repeated reachability qualitatively is 
decidable (Sec. 6.2), and that we can even compute the measure of runs satisfying 
the property with arbitrary precision (Sec. 7.2). 

There has been a huge amount of work on the verification of finite-state 
Markov chains (see, e.g., [20,28]). Since the buffers in TSO are unbounded, we 
however, get an infinite-state Markov chain. There is also a substantial litera- 
ture on the verification of infinite-state Markov chains, where specialized tech- 
niques are developed for particular classes of systems. Several works have con- 
sidered probabilistic push-down automata and probabilistic recursive machines 
[29,30,31]. However, these techniques don’t apply in our case since push-down 
automata cannot encode the FIFO store-buffer data-structure. 

Works such as [32,16,33,34] develop algorithmic and complexity results for 
checking termination and reachability for systems such as probabilistic VASS, 
probabilistic Petri nets, probabilistic multi-counter systems. Again, these models 
are different from ours and cannot encode FIFO queues. 

The works closest to ours are those on probabilistic lossy channel systems 
[16,17]. These works also rely on the frameworks of decisive and eager Markov 
chains. However, lossy channel systems and TSO are fundamentally different, 
and the manner in which we instantiate the frameworks of decisive/eager Markov 
chains differs. The decidability of verification for probabilistic extensions of lossy 
channels is sensitive to the definition of the message losses. In the case of lossy 
channel systems, if messages are only allowed to be lost at one end of the chan- 
nel (a model that is close to our notion of message updates), then all non-trivial 
verification problems become undecidable for probabilistic lossy channel systems 
[35]. Therefore, although there is a reduction from TSO to lossy channel systems 
in the case of non-probabilistic models [12], we know of no such reduction be- 
tween the corresponding probabilistic models. 

Finally, the concept of decisiveness has been extended to more general models 
such as generalized semi-Markov processes, stochastic timed automata [36], and 
lossy channel-based stochastic games [37]. 
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2 Preliminaries 


In this section, we introduce notation, recall basics of transition systems, Tem- 
poral logic and Markov chains. 


Basic Notation The size of a set A is denoted by |A|. We use A* and A” to 
denote the set of finite resp. infinite words over (a possibly infinite set) A, and 
let € be the empty word. For w € A*, |w| denotes the length of w (|w| = oo 
if w is infinite). For i: 1 < i < |w|, we use wfi] to denote the i*” element of 
w. We define head (w) := w[1] and tail (w) := w[2]---w[|w|]. We use a € w 
to denote that w|i] = a for some i: 1 < i < |w|. For words wı € A* and 
we E€ (A* U A”), we use wi + we to denote their concatenation. For k € N, we 
define A} := {w € A* | |w| = k}, i.e., it is the set of words over A of length k. 


Transition Systems A transition system is a pair (T, +) where I is a (potentially) 
infinite set of configurations, and >C I’ x I is the transition relation. We write 
y > 7 to denote that (y,7’) €—, and use Š to be the reflexive transitive 


closure of + . For k € N, we write y = y’ to denote that there is a sequence 
Yo > y > +++ — Ye where yo = y and yk = 7’, i.e., there is a sequence of 
k transition steps leading from y to 7’. For ~E {<,<,=}, we write y a y to 
denote that y “$ y for some m:0<m~k. 


Temporal Logic A run p of transition system T = (T,—) is an infinite word 
yoy1--. of configurations such that yi —> 741 for i > 0. We use pfi] to denote 
yi. We say that p is a y-run if p[0] = y. We use Runs (y) to denote the set of 
y-runs. A path r is a finite prefix of a run, and a y-path is a finite prefix of a 
y-run. We use the standard notation y y ¢ to represent that y satisfies the 
CTL* state formula ¢ and p y ¢ to mean that p satisfies the path® formula ¢. 
We refer the reader to [38] for details of CTL. 

For y € I’ and G C I, we say that G is reachable from y, denoted y Hr 40G, 
if there is a y-run p such that pli] € G for some i. Fork € N, ye T,andGCT, 
p =r OG says that p reaches G first at the kt” step. For ~ € {<,<,=,>,>}, 
p Er O*G says that p Hr 6G holds for some m : 0 < m ~ k. The statement 
p =r OFG says that p visits G at the k? step (but possibly earlier). 


Markov Chains A Markov chain C is a pair (IM) where I is a (potentially 
infinite) set of configurations, and M: Ix I — [0,1] is a transition proba- 
bility matrix over I’, called the probability matrix of C, i.e. M satisfies: Va € 
A. X sca M (a,b) = 1. A Markov chain C = (I’,M) induces an underlying transition 
system, denoted C+. We define Ct := (T, +), where >:= {(y,7') | M(y,7’) > O}. 
The underlying transition system has the same configuration set, with transi- 
tions between configurations that have non-zero transition probability under C. 
This allows us to lift the temporal logic concepts defined above to Markov chains. 


6 We term infinite sequences as runs and finite sequences as paths. However, tradi- 
tionally, CTL* refers to properties of infinite-sequences (our runs) as path-formulae. 
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Probability Measures Consider a Markov chain C = (IM). The probability of 
taking path m is the product of single step probabilities along 7: 


Probe(m):= [|| M(xfi],7[é+1)) 


i=0,...,]7|—1 


For a configuration y, we adopt the usual probability space on y-runs with the 
g-algebra over cylindrical sets starting from y (see [39,20] for details). For path 
formula ¢, we define Probe (y = ¢) = Probe ({p € Runs (y) | p Ee o}) (which 
is measurable by [40]), e.g. given a set F C G, Probe (y H OF) is the measure 
of y-runs which reach F. If Probe (y = ¢) = 1 the we say that almost all y-runs 
of C satisfy o. Following the literature, we say that y ec ¢ holds almost surely 
(almost certainly), or that @ holds almost surely from y. 


3 Concurrent Programs 


A (concurrent) program consists of a set of processes that run in parallel and 
communicate through a set of shared variables. The operation of the program 
is controlled by a central scheduler that selects the processes to execute one 
after the other. We assume a finite set Procs of processes that share a set 
æ of variables. Fig. 1 gives the grammar for a small but general assembly- 
like language that we use for defining the syntax of concurrent programs. A 
program instance, P is described by a set of shared variables, var*, followed 
by the codes of the processes, (proc reg* instr*)*. Each process p € Procs 
has a finite set Regs, of (local) registers. We assume that the sets of regis- 
ters of the different processes are disjoint, and define Regsp := UpeProcsRegs,,. 
Each process declares its set of 


: i prog ::= var*(proc reg* instr*)* 
registers, reg*, followed by a se- instr ::= lbl : stmt 
quence of instructions. We as- stmt ::= | var :=reg 


sume that the data domain of Vv | reg:=var 

and Regs>p is a finite set V, with | reg:=expr 

a special element 0 € V. | reg:=CAS(var ,reg,reg) 
| if reg then 1bl 

| 


x : 7 a 8 term 
Instructions An instruction i is 


of the form | : s where | is a Fig. 1. A simple programming language. 
unique (across processes) label and s is a statement. Labels represent program 
counters of processes and indicate the instruction that the process executes the 
next time it is scheduled. A read/write statement either writes the value of a 
register to a shared variable, reads the value of a shared variable into a regis- 
ter, or updates the value of a register by evaluating an expression. We assume 
a set expr of expressions over constants and registers, but not referring to the 
shared variables. The CAS statement is the standard compare-and-swap oper- 
ation, and if-statements have their usual interpretations. Iterative constructs 
such as while and for, as well as goto-statements, can be encoded with branch- 
ing if-statements as usual. 
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The fence statement, that flushes the contents of the buffer of the process, 
can be simulated using the CAS statement. The statement term will cause the 
process to terminate its execution. Sometimes, we will refer to an instruction by 
its statement, e.g. the instruction r:=x, (where r is a register and x is a shared 
variable) a read instruction, similarly for a write instruction, etc. Semantics of 
these instructions are explained through a set of inference rules in Sec. 4. 


Labels We define Lbl, to be the set of labels that occur in the code of the 
process p, and define Lblp := UpeprocsLblp. We assume that term has the label 
ae We define Instr, to be the set of instructions occurring in p, and define 
Instrp := UpeprocsInstr,. For instruction i of the form | : s we define À (i) := | 
and stmt (i) := s. Abusing notation, we also define stmt (I) := s. For a process 
p € Procs instruction i € Instr,, with stmt (i) # term, we define next (i) to 
be the (unique) instruction next to i in the code of p. For an instruction |; : 
(if a then l2), we assume, without loss of generality’, that |, Æ lo. 


Scheduler The scheduler selects the process from Procs to run next. The opera- 
tional model for classical TSO [41] uses a non-deterministic scheduler. We adopt 
a scheduler that selects the next process probabilistically. The scheduler policy 
is defined by a function Sched: Sched(p) € N denotes the scheduling weight 
assigned to to the process p. If p is enabled (i.e. the process can execute the next 
instruction, formally defined in Sec. 4) then p is scheduled at the next step with 
a probability that is proportional to Sched (p). 


4 Operational Semantics 


The operational model for classical TSO [41] describes the semantics as a tran- 
sition system. We also take an operational approach. However, we differ in a 
fundamental aspect: classical TSO models choice between transitions as non- 
deterministic choice. We on the other hand, model this as probabilistic choice, to 
get a system called as Probabilistic TSO (PTSO for short). Adding probabilities 
induces a Markov chain, which governs the behaviours of PTSO. 

A program is described by a pair: the set of processes, Procs and the scheduler 
policy Sched. In this section, we fix such a program P = (Procs,Sched). We 
develop the operational semantics of P under PTSO as an infinite-state Markov 
chain [P]"° := (I’p,Mp). We begin by defining the set of configurations Ip 
(Sec. 4.1). Then we describe the behavior of P under classical TSO using a 
transition system [P]"S (Sec. 4.2); Finally, we extend the transition system to a 
Markov chain [P]" by giving probability distributions that define govern process 
scheduling, and memory updates. 


T We make the restriction for technical convenience. The case where lı = l2 do not 
introduce conceptual difficulties. However, it simplifies the presentation by eliminat- 
ing some corner cases when we define probability measures (Sec. 5) and when we 
introduce our cost model (Sec. 8). 
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4.1 Configurations 


The central feature of TSO is the store buffer: a FIFO buffer in which pending 
write operations are queued as messages. The semantics equips each process 
p € Procs with an unbounded buffer, here called the p-buffer, that carries pending 
write operations issued by p, but that have yet not reached the shared memory. 

A configuration, (A, R, B, M}, describes four attributes: a labeling state (A), 
a register state (R), a buffer state (6), and a memory state (M). We use Ip to 
denote the set of configurations of P. 

A labeling state is a function A: Procs + Lblp that defines, for p € Procs, the 
label A (p) € Lbl, of the next instruction to be executed by p. 

A register state is a function R: Regsp — V that maps each register a € 
Regsp, to its current value R(a) € V. For an expression e, we use R(e) to denote 
the evaluation of e against the register state R. 

A single-buffer state w is a word in (¥ x VY)", describing the content of the 
p-buffer for some process p € Procs. The buffer contains a sequence of pending 
write messages, i.e. pairs of form (x,v) representing a write to x, with value v. 
A buffer state is a function B: Procs + (¥ x V)“ that defines, for each process 
p € Procs, a single-buffer state describing the content of the p-buffer. 

A memory state is a function M: ¥ —> VY that assigns to each variable x € X 
its current value M (x) € V in the shared memory. 


write read expr 
, stmt (NG) Sea) stmt (A (p)) = (a := x) stmt (A (p)) = (a := e) 
B' = B[p + (x, R(a)) - B (p)] FetchVal (x) (B (p)) (M (x)) =v R' =Rlacy] R' =R |a + R(e)] 
X = à [p & next (A (p))] A = à [p & next (à (p))] X = X[p & next (à (p))] 
(A; R, B, M) Žrproc (A', R, B', MY (A, R, B, M) * proc (A', R', B, MY (A, R, B, M) Žrproc (A', R', B, MY 
CAS-true CAS-false 
stmt (À (p)) = (b := CAS (x, a1, a2)) M (x) = R(ar) stmt (A (p)) = (b := CAS (x, a1, a2)) M (x) # R(ar) 
R' = R [b + true] B(p) =e M = M {x + R(a2)] R' = R [|b + false] B(p)=e€ 
A' = A [p — next (A (p))] A = à [p & next (A (p))] 
(A, R, B, M} * proc (A, R', B, M'Y (A, R, B, M) Spros (A', R', B, MY 
if-true if-false 
stmt (À (p)) = (if a then I) stmt (À (p)) = (if a then |) proc disabled 
R(a) = true R(a) = false (A, R, B, M) By proc ON RE B', M'Y y is disabled 
N=Alfpe]] A [p < next (A (p))] (\,R, B, M) proc ORE, B, M’) ee 


(A, R, B, M) * proc (1, R, B, M} (A, R, B, M) proc (A', R, B, MY 
single-update 
empty-update B' (p) = w; (x, v) B" = B' [p+ w] update 
M" = M' |x € y] (A, R, B, M) “update (A, R, BY, M’) 
(VR, B, M} Space (A R, B, M) VRB, M) Suposto (A, R, B', MY) TA, R, B, MY rupaace (A, R, BM) 
(A, R, B, M) = supaate (A, R, B", M”) 
Full-TSO 
(A, R, B, M) proc (X, R',B', MY) (X, R',B', M'Y) —rupaate (A, R”, B”, MY 
(A, R, B, M) op (A", R", B”, M") 


Fig. 2. The classical TSO semantics: process transitions (green), update transitions 
(orange) and overall transition (Ful1-TSO) 


Consider a configuration y = (A, R, B, M}. We say that y is plain if B (p) = € 
for all p € Procs, i.e., all the buffers in y are empty. We use [5'*** to denote 
the set of plain configurations of P. Notice that TH C Ip and that TRH is 
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finite. For a label | € Lblp, we write | € y if A (p) = I for some p € Procs. We 
define IS := {y € Ip | | €y}, i.e., configurations in which | occurs. 

For a configuration y = (A,R,B,M) we define the size of y by |y| := 
ye peProcs |B (p)|, i.e., it is the total number of messages in the buffers in y. For 
~E {<,<,=,>,>}, we define [5% := {ye Tp | |yl~ 4. , ie. configurations 
where the total number of messages, m, relates to £ by m ~ £. 


4.2 The Classical TSO Semantics 


We recall the classical semantics of TSO, using a transition system [P] = 
(ITp,—>p). We define the transition relation +p through the set of inference 
rules in Fig. 2. The relation +p is the composition of two relations: the relation 
—proc describes the processes’ execution steps, and the relation —ypaate describes 


memory updates, where pending writes are propagated to the memory. 


Process Transitions We define the process transition relation —proc:= 
UpeProes F, proc as a union of relations each corresponding to one process (the rule 


proc). The inference rules defining P, procs for a process p € Procs are depicted 
in Fig.2. Each rule corresponds to one step performed by p. After executing an 
instruction, p will move on to the next instruction in its code. It executes the 
latter instruction when again selected by the scheduler. 

A write instruction (x := a) assigns the value of the local register a to the 
shared variable x. The process appends a write message consisting of x together 
with the value R (a) of a, to the head of the p-buffer. A read instruction, (a := x), 
assigns the value of the shared variable x to the local register a. The value of x 
is either fetched from the p-buffer (read-own-write), or from the shared memory 
(read-from-memory). We capture both cases in one inference rule, using the 
function FetchVal defined as follows. Let w be the contents of the p-buffer. We 
write x € w if (x, v} € w for some v € V, and write x ¢ w otherwise. We define 
(i) FetchVal (x) (w) (M) := v if x € w and w = wy: (x, v) + we with x ¢ w1; and 
(ii) define FetchVal (x) (w) (M) := M (x) if x g w. In case (i), the value of x 
is taken from the latest x-message from the p-buffer. In case (ii), no x-messages 
exist in the p-buffer, and the value is read from the shared memory. 

The instruction b := CAS (x, a1, a2) checks whether the p-buffer is empty and 
the value of the shared variable x is equal to the value of the register a,. If 
yes, we assign atomically the value of the register ag to x, and assign the value 
true to b (the rule CAS-true). If the value of x is different from the value of 
a; then we do not change the value of x, but assign the value false to b (the 
rule CAS-false). If the p-buffer is not empty then p is disabled in the current 
configuration. We define the set of disabled processes at configuration y: 


disab (y) := {p | (stmt (p) = term) V ((stmt (p) = (b := CAS (x, a1, a2))) A (B (p) # €))} 


In other words, it is the set of processes that are disabled in y either because 
they have terminated or because they are about to perform a CAS operation 
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and their buffers are not empty. We say that p is disabled in y if p € disab (y), 
and that y is disabled if all the processes are disabled in y. If a process (resp. 
configuration) is not disabled then it is enabled. If y is disabled, we make a 
dummy transition that does not change y (the rule disabled)®. Notice that if 
y 5, ioe y’ then there is unique process p € Procs such that y Arot y. 


Update Transitions Between two process transitions, the system may perform 
a (possibly empty) sequence of update steps. The rule empty-update describes 
an empty update step. Each single-update step pops one write message at the 
end of the p-buffer for some process p and uses it to update the memory. The 
update rule captures the effect of a sequence of such single-update steps. We 
define the update transition relation —update:= UaeProcs* -> update as a union of 
relations each corresponding to a given sequence of update steps. The word a 
gives the sequence of processes that perform the updates. The net effect is that 
the system (i) pops a sequence of ( possibly empty) suffixes from the buffer of 
each process, (ii) shuffles these into one sequence, and (iii) uses the resulting 
sequence to update the memory. Notice that each selection of possible suffixes 
in step (i) may result in several different sequences due to multiple interleavings 
in step (ii). Observe that —>p is deadlock-free, i.e., for each configuration y € I, 
there is at least one configuration 7 € I’ such that y >p 7’. 


4.3 Adding Probabilities: PTSO 


We define the Markov Chain [P] = (Ip,Mp). The set Ip of configurations 
is defined as above. The probability matrix Mp is defined as the composition of 
two probability distributions: (i) the process probability distribution Mproc (ii) 
the update probability distribution Mypaate which add probabilities to the process 
transition relation —>proc, and the update transition relation update respectively. 


The Process Probability Distribution: the Scheduler At each program 
step (~p), a process is selected for execution according to a probability given 
by the scheduler. In a configuration y, the scheduler selects an enabled process 
p € enab (y) with a probability that reflects the relative weight of p compared 
to those of the other enabled processes, Rweight (7) (p): 


0 if p € disab (y) 


Rweight (y) (p) = i T 
Dip! cenab( y) sched(p’) if p € enab (7) 


(1) 


This gives the probability that p to execute in the next step from y. For con- 
figurations y and 7’, with 7 Bre qy’, we define Mproc (Y, 7’) := Rweight (7) (p). 
In other words, we move from y to y’ with a probability that is given by the 
relative weight of p in y. We define Mp (y, 7’) := 0 if y procy’. To account for 


8 The latter transition is not strictly needed, but it is included for technical conve- 
nience. 
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the case where all the processes are disabled in y, we define Mproc (7, 7) := 1 if y 
is disabled. 


Faithfulness Our model uses a scheduling policy that assigns a fixed scheduling 
weight, Sched (p), to each process p in the system. This is a case of memoryless 
scheduling, i.e., the probability distribution over processes does not depend on 
the execution history. However, we can relax this constraint to allow for any 
scheduling policy that satisfies the faithfulness condition: 


Vp € Procs Rweight (7) (p) =0 <=> p € disab (y) 


In words, at each step, each enabled process should be scheduled with non-zero 
probability. A scheduler that assigns scheduling weights such that the above 
condition holds is said to be a faithful scheduler. 


Schedulers with memory The above criterion allows for schedulers that are more 
refined as compared to the memoryless scheduler. As an example, on imple- 
mentations of TSO, processes are often scheduled for multiple consecutive steps 
since unnecessary context switching wastes processor resources. To reflect this 
detail, we can consider a scheduler that assigns a higher probability to the pre- 
viously scheduled process, ppry. For some choice of constant weights, Sched, we 
can define a new choice of weights Sched’ where A > 1 is some parameter. 


Sched’(p) = Sched (p) if p#Ppry and 2A-Sched(p) otherwise 


In this case, ppry is re-scheduled with a weight which is larger by a factor of A. 
A larger implies a stronger tendency to re-schedule a process. This schedul- 
ing policy still satisfies faithfulness. One can extend this by formulating more 
intricate policies, e.g. ones that account for k previous steps. 

To better illustrate the concerns and challenges of verification, we continue 
to adopt the simple (memoryless) scheduler proposed earlier. However, we em- 
phasize that our results extend to faithful schedulers. 

The Update Probability Distribution: the Memory update policy Be- 
tween the process steps, pending messages from the store buffers are propagated 
to the shared memory (the update transition). The details of this write propa- 
gation are implementation-specific, with policies tuned towards system perfor- 
mance. Classical TSO models this update propagation non-deterministically. We, 
on the other hand, consider a probabilistic update policy. In a similar manner 
to the scheduling probabilities, the update probability distribution defines the 
probability by which a configuration y reaches another configuration y’ through 
an update step (—upaate). Recall that an update step consists of a sequence of 
(single) update operations. The number of possible update sequences from y is 
finite since the sizes of each buffer is finite. In our model, we assume that the up- 
date distribution is the uniform distrbution over all possible update sequences. 
We note that starting from y, different update sequences can lead to the same 
configuration y’. The reason is that different shufflings of the selected suffixes 
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(see Sec. 4.2) may lead to the same memory state. To reflect this, for configura- 
= [{o | Yapas }| 
7 {o | aw. Yupan y” | 
of update sequences that lead to the configuration y’. 


, Le. the fraction 


tions y and 7’, we define Mupaate (7, 7’) : 


Left-Biasedness Though we adopt a specific update distribution, we provide a 
generic condition on that update policy that is sufficient for our results to hold. 
We call this the left-biasedness property. Here we provide an intuitive description 
of left-biasedness and defer the formal definition to Sec. 8. 

Intuitively, left-biasedness requires that for sufficiently large configurations, 
the probability that the configuration size reduces in a single —>p step is strictly 
greater than p for some p > $. Left-biasedness allows a wide class of more refined 
scheduling policies, e.g., where no message propagation is performed when the 
number of messages is smaller than a certain value, or where only the messages 
inside the buffers of some (probabilistically selected) processes are propagated. 

Though our results apply more generally to models characterized by faith- 
fulness (scheduler policy), and left-biasedness (update policy), we continue to 
adopt the fixed-weight (memoryless) scheduler and uniform update policy for 
reasons described above. 

The Full Probability Distribution. We combine the process and update 
probability distributions, to derive the probability matrix Mp, and thus obtain 
the Markov chain [P]"°. Consider configurations y and y’ where y +p 7‘. Let 
7" be the unique configuration such that y proc Y” —update 7’. Then, we define 


Mp (7, y’) = Mproc (y, y") : Mupaate (y"" y’). 


Lemma 1 Mp is a prob. distribution on Ip; hence, [P]" is a Markov chain. 


5 PTSO: Concepts and Properties 


Now, we intuit some concepts underlying Probabilistic TSO and its properties. 


PTSO Refines Classical TSO. After introducing [P]" and [P]"° in Sec. 
4, we s.t. they are closely related; [P]"S is the underlying transition system of 
(Py. 


Lemma 2 ([P]"°)! = [P]" for any program P. 


In particular, this means that the PTSO system [P]"° is a refinement of 
[P]: a behaviour is observed in [P]*® iff it is seen in [P]"° with non-zero 
probability. Whenever the context is clear, we write P instead of [P]™, [P]"°. 
Label Reachability. We formulate our verification problems in terms of reach- 
ability to instruction labels. To simplify the notation, we identify a label | € Lblp 
with the set Ip of configurations in which | occurs. We say that “I is reachable” 
rather than “IŁ is reachable”, and write QI instead of Q {y € Ip | |e y}. In 
[13,12] the authors show that label reachability from a plain configuration is 
decidable. The following lemma, generalizes this to the case where the source 
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configuration need not be plain and destination can be a particular plain con- 
figuration. 


Lemma 3 For a program P, a configuration y € Ip, and a plain configuration 
ye i it is decidable whether y Šp y'. 


Extending this, we have Lemma 4: we can query whether y +p 7 for each 
y er UT es Decidability of Lemma 4 follows since I oS is finite and the 
subroutine is decidable by Lemma 3. 


Lemma 4 For a program P, a configuration y € Tp, and a label | € Lblp, it is 
decidable whether y Šp |. 


5.1 Left-Orientedness and Attractors 


We show that the set of plain configurations I on set has an attractor property 
in the sense of [16]. In our setting, this means that any run of [P]"° almost surely 
visits TBH" infinitely often. 


Small and large configurations To arrive at this result, we consider a generaliza- 
tion of plain configurations, called small configurations, denoted D84, parah 
consists of configurations with a small number of messages inside their buffers. 
Concretely, a configuration y is small if |y| < 4, i.e., the total number of messages 
inside the buffers does not exceed 4. ° We define the set of large configurations 
by TRS := Pp — rent = 13°. We show that the Markov chain [P] is left- 
oriented in the sense of [42]. That is, for any large configuration y € I Baa the 
expected change in configuration size for a single +p step is negative. 


An illustrative example We explain the update a sD 
probability distribution through the code snippet 0: x = 1 | 3. =x 
on the right. To begin with let us only consider the 1: 8°%° Ol g, goto 2 


process on the left (procL). It executes an infinite loop, writing 1 to variable 
x. Let us consider the evolution of the buffer-sizes of procL, i.e. the number 
of (x,1) messages in the procL-buffer. Assume that on reaching label 0, procL 
has 6 messages in its buffer. The —>p step consists of a process transition, proc 
followed by an update transition, —upaate- In the proc step, the write increases 
the size of the buffer by one, thus obtaining a buffer of size 7. Following this 
the update Step may push any number of messages to the memory. Since the 
update policy chooses uniformly amongst possible update sequences, the result- 
ing configuration has one amongst {0,...,7} messages in the procL-buffer, each 
occurring with an equal probability of 1/8. The next —proc step (a goto), does 
not change the buffer size, but the update step can still propagate messages. 
The reasoning for the next steps follows similarly. 


° This value is an artifact of the probabilistic policies we have adopted in Sec. 4 
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Comparison with other notions of fairness At each proc Step atmost one mes- 
sage is added to the process buffers (when the process performs a write), however 
in the following —>update can still remove large number of messages. Hence, from 
sufficient large configuration sizes, the system has a tendency to move towards 
configurations with smaller buffer sizes. Formally, we prove the following lemma, 
using the left-orientedness property mentioned earlier. 


Lemma 5 Probp (7 H= i) = 1 for all configurations y € Tp. 


For the above example, PTSO guarantees that the process on the right 
(procR) eventually reads value 1 into register a. This follows since in a plain 
configuration, the buffer of procR is empty and hence it can read the value from 
the memory - this happens almost surely. We highlight that other notions of 
fairness such as strong fairness in process scheduling (discussed in [27]) as well 
memory fairness [26], cannot provide this guarantee. In particular, memory fair- 
ness from [26], would consider the execution which exactly alternates writes of 
both processes but procR reads before its own write is pushed memory to be fair 
and hence permissible. 


x=1 x=2 a=x // 2 x=1 x =2 a=x // 2 z= í 


B-Plain Configurations We can refine our analysis of the attraction property 
enjoyed by the set ['"*** of plain configurations. We consider a subset of TRA 
which we call the set of bottom plain configurations, (or B-plain configurations, 
for short), denoted [?****. Intuitively, a B-plain configuration is a member of 
a bottom strongly connected component in the graph of plain configurations. 
Formally, a configuration y € Ip is said to be B-plain if (i) y € D5", and 
(ii) for any 7/ € TRA, if y >p 7 then 7/ Šp ¥. Since any run of the system 
almost surely visits the set of TB" infinitely often, it will also almost surely 
visit a B-plain configuration infinitely often. 


Lemma 6 Probp (7 = i) = 1 for all configurations y € Ip. 


6 Qualitative (Repeated) Reachability 


Given: a program P, a configuration yini E Ip, a label | € Lblp 


QUAL_REACH: Determine whether Probp (Yinit = Ol) = 1 


QUAL_REP_REACH: Determine whether Probp (Yini | OQI) = 1 


In this section, we perform qualitative reachability analysis for PTSO. Given a 
program P, configuration Yinit, and label |, we check whether a 7;,;4-run almost 
surely reaches |. We also consider qualitative repeated reachability, where, we 
ask whether a ŅYini-run repeatedly visits | (visits | infinitely often) w.p. 1. We 
also consider almost-never variants of the problems, where we check whether the 
probabilities are 0 rather than 1. We prove that these problems are decidable, 
and have non-primitive-recursive complexities. 
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6.1 Almost-Sure Reachability 


The qualitative reachability problem, QUAL_REACH, is defined above. The al- 
gorithm in Figure 3 solves QUAL_REACH by analyzing the transition system 
[P]*s, the underlying transition system of PTSO. If | occurs in Yin then 
the property trivially holds, and hence we answer positively. Otherwise, the 
algorithm considers a new program P’ obtained by replacing the statement 
labeled |, by a new statement that makes P’ terminate immediately if | is 
reached. Let p € Procs be the unique process such that | € Lbl,. We define 
P Ol := (Procs — {p} U {p’} , Sched) where p’ is a fresh process derived from 


p by replacing stmt (1) by goto Itexm 
adding a term at label It@™™ 

The loop on line 3 cycles through 
the (finite) set of plain configurations. 
For each plain configuration y from 
the original program P, we check: (i) 
Whether y is reachable from the ini- 
tial configuration Yini in P’. By the 
construction of P’, this is equivalent to 
checking whether y is reachable from 


for a fresh label goto [t°™™ ¢ Lblp and 


new 


. The remaining instructions of p’ are identical to p. 


Algorithm: QUAL_REACH 
Input: P: program; Yinit € Ip: 
configuration; | € Lblp: label. 
1 if | € yin then return true; 
2P':=Pol; 
3 for each 7 € TE do 
4 if Yinit Šp y and ~ (7 = yp 1) 


then return false ; 


Yinit in P without observing label I. (ii) 5 return true 

Whether it can reach the label |. If the 

answer to (i) is yes, and the answer to Fig. 3. Almost-sure reachability algorithm. 
(ii) is no, then we have found a finite path m in P that starting from Yinit, 
without visiting l, reaches configuration y from which | is not reachable. This 
implies that Probp (Yint = Ol) < 1. If none of the plain configurations satisfy 
the condition, then each plain configuration y reachable from Yjniz has a path to 
|. Now by the attractor lemma, any run will almost surely visit D5'**" infinitely 
often and by the fairness property of Markov chains, it almost surely visits l. 


6.2 Almost-Sure Repeated Reachability 


For almost-sure repeated reachability we are interested in determining whether 
the yinie-runs visit | infinitely often with probability 1. The algorithm for this 
is similar to the case for almost-sure reachability: we check whether J a plain 


configuration y that satisfies Vinit pa yAn h Šp l), in which case we return 


false. The difference is that we do not need to transform the program as in the 
case of almost-sure reachability. Details are in the supplementary material. 


6.3 Almost-Never (Repeated) Reachability 


The almost-never variants of the (repeated) reachability problems, 
NEVER-QUAL-REACH resp. NEVER-QUAL-REP-REACH, ask whether the 
probabilities equal to 0 rather than 1. The solution to NEVER-QUAL-REACH 
is straightforward, since Probp (Yinit K Ol) = 0 iff =(Yinit >p l). On the other 
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Given: a program P, a configuration yini E Ip, a label | € Lblp 


NEVER_QUAL_REACH: Determine whether Probp (Yini = Ol) = 0 


NEVER_QUAL_REP_REACH: Determine whether Probp (Yinit $| O91) = 0 


hand, the NEVER_QUAL_REP_REACH problem requires a search over B-plain 
configurations y satisfying Vinit Šp y =p |. Due to space constraints, we defer 
the algorithm and proofs to the appendix. 


6.4 Decidability and Complexity 


The algorithms can be effectively implemented since (i) TB® is finite; and 
(ii) the conditions of the for-loops and if-statements can be checked effectively, 
as implied by Lemma 4. This gives Theorem 1. Theorem 2 is proved through 
reductions from the reachability problem under the classical (non-probabilistic) 
TSO semantics [19]. The non-primitive-recursive lower bounds follow from the 
corresponding result for reachability of classical TSO. 


Theorem 1. QUAL_REACH, QUAL_REP_REACH, NEVER _QUAL_REACH, 


NEVER_QUAL_REP_REACH are all decidable. 


Theorem 2. QUAL_REACH, QUAL_REP_REACH, NEVER _QUAL_REACH, 
NEVER_QUAL_REP_REACH all have non-primitive-recursive complexities. 


7 Quantitative (Repeated) Reachability 


In this section we discuss quantitative reachability problems for PTSO. In con- 
trast to qualitative analysis from Sec. 6, the task here is to compute the actual 
probability. We are not able to compute the probabilities exactly, but we can 
approximate the probability with an arbitrary degree of precision. 


7.1 Approximate Quantitative Reachability 


Given: program P, configuration Yin € Up, label | € Lblp, precision value e € R? 


QUANT_REACH: Determine 0 s.t. Probp (Yinit Ol) € [0,0 + €] 


QUANT_REP_REACH: Determine 9 s.t. Probp (Yin = OQI) € [0,0 + €] 


In the approximate quantitative reachability problem, QUANT-REACH, given 
a precision parameter €, we are interested in determining an approximation 0 
satisfying 0 < Probp (Yinit = Ol) < 0 +€. 

The algorithm in Fig. 4 solves the problem by successively improving the 
approximation at each iteration until we are within -precision of the exact value. 
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The algorithm maintains two variables: PosApprx (positive approximation) is an 
under-approximation of the probability with which | is reachable from Yinit, and 
NegApprx (negative approximation) is an under-approximation of the probability 
with which | is not reachable from 7iniz. PosApprx serves as a lower bound on 9, 
while, 1 — NegApprx serves as an upper bound: PosApprx < 0 < 1 — NegApprx. 


Algorithm: QUANT_REACH 


Input: P: program; Yini € Mp: configuration; | € Lblp: label; e € R*°: 

precision. 

1 Var 

2 | PosApprx, NegApprx € R: approximations, waiting € (Ip x R)*: queue 

3 PosApprx := 0; NegApprx := 0; waiting := (init, 1) 

4 while PosApprx + NegApprx < 1— edo 

5 (y, $) := head (waiting); waiting := tail (waiting) 

6 if | € y then PosApprx := PosApprx + @ 

7 else if =(7 Š» I) then NegApprx := NegApprx + ġ 

8 else 

9 | for each y' with y >p 7 do waiting := waiting: (7',¢- Mp (7,7')) 
10 return PosApprx 


Fig. 4. The quantitative reachability algorithm. 

The algorithm iteratively improves these approximations until we reach a 
point where their sum is within £ from 1 (line 4). In such a case, the desired 
value of 0 = PosApprx is an €-precise approximation. 

To calculate the approximations, the algorithm performs forward reachabil- 
ity analysis starting from the initial configuration Yini. It generates the set of 
Yinit-paths in a breadth-first manner, using the waiting FIFO queue. For each 
generated path m it also calculates the probability of m. Instead of the whole 
path 7, waiting only stores the last configuration, y, of m and the probability 
of 7, ¢, as a pair (y, d). 

The approximation variables are initialized (line 3) to zero, and waiting 
queue is initialized to contain a single pair, (Vinit, 1), representing the initial 
configuration Yinit (which occurs with probability one). The while-loop executes 
until we achieve the desired precision. At each iteration, we check whether we 
already have reached the desired precision. If not, the algorithm pops the pair 
(y,¢) from the waiting-queue. There are three possibilities depending on q: 


1. If | € y (if branch, line 6), the current path reaches | and, consequently, we 
increment PosApprx by ¢, the weight of the current path. 

2. If | is not reachable from y (else-if branch, line 7), the measure of runs that 
reach | starting from y is zero, and hence we increment NegApprx by @. 

3. If neither of the above hold (line 10), the current path needs to be explored 
further, we enqueue all successors 7 of y into the queue. The probability of 
the new path to 7’ is d- Mp (y, 7’). 


To show correctness of the algorithm, let PosApprx and NegApprx"”) repre- 
sent the value of PosApprx and NegApprx prior to performing the i” iteration. 
We show that in the limit as i — oo, the value of PosApprx“’) +NegApprx”) tends 
to 1. Technically this follows by Lemma 5. By this lemma, any Yini-run almost 
surely either (i) reaches a plain configuration from which | is not reachable, or 
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(ii) repeatedly reaches a plain configuration from which | is reachable. In case (ii) 
it will almost surely reach |. This implies that Probp (Yint = (O(I v =391))) = 1, 
i.e., an Yinig-run will almost surely either reach | or reach a configuration from 
which | is not reachable, implying that PosApprx) + NegApprx“) tends to 1. 
Finally, by Lemma 4 we can effectively check the condition of the if-statement, 
and hence the algorithm terminates. 

The correctness of the approximation on termination follows by the property 
that PosApprx™ and NegApprx") are under-approximations of the reach and 
non-reach probabilities. This follows from the following invariants: 


PosApprx < Probp (Yini = QI) NegApprx” < Probp (Vinit = YOI) 
Probp (Yinit F Ol) < 1 — Probp (Yinit = YOI) 
PosApprx“”) + NegApprx® > 1 -— e holds on termination 


These imply that, on termination, PosApprx is within ¢-precision of 0. 
Theorem 3. QUANT_REACH is solvable. 


7.2 Approximate Quantitative Repeated Reachability 


In the case of the approximate quantitative repeated reachability problem, we are 
interested in approximating the probability of visiting a given label | infinitely 
often. We develop an algorithm that uses an iterative approximation scheme 
similar to the reachability case. We defer full details of this algorithm to the 
supplementary material and instead give an intuitive explanation on how it 
differs from Sec.7.1. 

This algorithm too maintains approximations PosApprx and NegApprx and 
iteratively narrows the error margin until it is smaller than £. The main difference 
is in the condition at line 6 of Figure 4. In the case of reachability the lower 
estimate PosApprx, is increased when | € y. In the repeated reachability case, 
this is not sufficient; we need to ensure that there is no state y’ that is reachable 
from the current state y and such that | is not reachable from q’. The existence 
of such a 7’ implies existence of a non-zero measure continuation of the current 
run in which | is not reached infinitely often. Hence, the conditional of the if- 
statement is modified to: Yy’ € BPlain. (y Šp 7’) > (Y 5» I). 

We note that naively we would have to check the above condition for all 
configurations y’ € Ip, which is infeasible since Ip is an infinite set. We address 
this by using Lem. 6, which shows that runs from all configurations eventually 
reach a B-plain configuration. Hence it is sufficent to only check the condition for 
the (finitely many) B-plain configurations, which are precomputed in BPlain. 


Theorem 4. QUANT_REP_REACH is solvable. 


8 Expected Average Costs 


In this section, we develop a cost model for concurrent programs where we assign 
a cost to the execution of each instruction, the goal begin to approximate the 
expected cost of runs that reach a given label. 
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8.1 Computing costs over runs 


A cost function Cost : Lblp + N° for program P defines for each label | € 
Lblp the cost of executing the instruction at |. A particular way to define the 
function is to assign a cost to each instruction in the programming language, 
so that Cost (1) depends only on stmt (I) and not on | itself. But we consider 
the general case. We extend Cost to runs as follows. Consider configurations 
y = (A, R,B, M) and 7’ such that y +p 7’. If y 2p 7’, for process p, then we 
define Cost (y, y’) := Cost (A (p)). In other words, it is the cost of the instruction 
executed by p. Recall from Sec. 4 that p is unique and therefore the function 
is well-defined. If disab (y) or if =—(y +p 7) then we define Cost (y, y’) := 0. 
Consider a run p € {Runs (y) | p Ep O'l}, i.e. a y-run that reaches | for the 
first time at step i. We define Cost (p) (I) = di) <j;<)-1 Cost (ply), pli + 1]), i.e, 
the sum of costs of all executed instructions along p up to the first visit to I. 


For a configuration y, a label l, and a cost function Cost, we define a random 
variable Xy1,cost : 2 + R over support N = y- TẸ as follows: 


0 p ¢ {Runs (7) | p =p O'l} 
Xy1,cost (P) = Cost (p) (1) otherwise 


Xy1,.Cost (p) = 


plain 


Given: program P, configuration Yint E Tp , cost function Cost: Lblp > N7°, 
label | € Lblp s.t. yin H| Ol, precision ae cE Rt 


Exp_AvE_Cost: Determine 0 s.t. E (Xymitl,Cost | Vinit = 01) € [0,0 + €] 


s defined as the ex- 


The expected average cost problem E(Xy, cost) is 

pected cost of reaching | from y and E(Xy 1 cost | yH 30I) as the condi- 
tional expectation over runs that reach |. If a(y Hp 30I) then the expected 
cost is not defined. If however y Fp JOl then E(Xy 1 cost | YR AOI) = 
E (Xy,,cost)/Probp (y Ep Ol), which follows since for the non-reaching runs, 
the cost is zero. We present the expected average cost problem, in the figure 
above, where we want to approximate E (Xy1,cost | y = IQI) to e-precision. 


8.2 Eagerness 


Our solution to Exp_AVE_CosT relies on the fact that [P]"° satisfies an ea- 
gerness property in the sense of [17]. In our setting, eagerness means that the 
probability of avoiding the target label | decreases exponentially with the num- 
ber of steps. Concretely, we show that there are two constants: the eagerness 
degree Ep € R*°, and the eagerness threshold np € R>? satisfying the following: 


Vy € Ppt Vie Lblp Vn > np yp 30l > Probp (y =p nl) < (Ep)" 


i.e. for n > np, the probability of avoiding | during the first n steps decreases 
exponentially with n. The following lemma forms the crux of this section. 


Lemma 7 (Eagerness Lemma) Ep and np exist and are computable. 
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We devote this sub-section to give an overview of the the proof of Lemma 7 
(the formal proof is provided in the supplementary material). We consider the 
behavior of runs with respect to the small and large configurations, exploiting 
the fact that the runs of the system tend to gravitate towards the small con- 
figurations. However here we use a property, called left-biasedness (defined in 
Sec. 8.2), that is stronger than the left-orientedness property of Sec. 5.1. 

To prove Lemma 7, we show that, for a small configuration y € DAH, the 
runs from y satisfy the following three properties with a high probability: (i) 
they make their first return to LA} within a small number of steps, (ii) they 
return to S multiple times, within a small number of steps, and (iii) if they 
eventually reach | then they will do that within a few steps. We collect these 
results to obtain the proof of Lemma 7. 


Gravity: First Return We recall that buffer sizes can increase by at most one 
during process transitions, and that any number of messages can be flushed to 
the memory during an update transition (Sec. 4 and Sec. 5.1). Based on this, we 
show left-biasedness, defined as follows: 


Left-biasedness Vy € Ip"®° the probability of moving from y to a 
smaller configuration is bounded below by 2/3 and that of moving to a 
larger configuration is bounded above by 1/3, regardless of P. 


Using left-biasedness, we show that the set TS"? has a gravity property, 
namely, a run starting from a small configuration will, with a high probability, 
return to the set TS (for the first time) within a few number of steps. Formally, 
we define the gravity parameter Gp as follows: ĝ := 2/3, p := 1/3, and Gp := 


2/¢: p= 2x2, We prove the following lemma. 


Lemma 8 (Gravity Lemma) Probp (y Ep OF” rY) < (Gp)", for all 
ye Te" andallneN. 


The lemma states that, starting from a small configuration, the probability 


that a run avoids A=? in the next n steps decreases exponentially with n. 


Multiple Revisits Notice that the gravity lemma is concerned with the first 
return to the set of small configurations. We will now apply this argument re- 
peatedly to conclude that, with high probability, multiple re-visits to small con- 
figurations take place “quickly”. That is, the set of runs starting from TS% and 
frequently re-visiting TA has a high measure. To formalize these arguments, 
we make the following definition. For m,n: 1 < m < n, we define Visitp (n,m) 
to be the set of runs that visit the set TS% exactly m times in their first n — 1 
steps!°. We use the Visit predicate to partition the set of y-runs, depending on 
how often they return to [§"*"" during their first n steps. We distinguish these as 


10 For technical convenience, we use n — 1 instead of n in the definition of Visit. This 
allows us to avoid some corner cases in the proofs. 


338 P. A. Abdulla et al. 


Sporadic-Runs (S-Runs): runs that visit the rS? sporadically during their first 
n steps, and Frequent-Runs (F-Runs): runs that visit TA frequently during 
their first n steps. We will derive a constant v € N (see below) that delineates 
the border between these sets. We formally define: 


SRuns (7) (n) := Ui<m<| 2 | {p E€ Runs (7) | p H Visitp (n,m)} 


FRuns (7) (n) := Uja] +1<m<n {P € Runs (7) | p = Visitp (n, m)} 


timeline 0 1 2 3 4 5 6 7 8 9 10 11 12 = =: n a 
S -@— 0o 0e o o o o o o o o o o BEIT 
Fee ee oee eoe lc aT 
D -0—0 0—0 o e e e 0 0 oo e 0 


Fig. 5. Figure depicting configuration sequences of S, F and D runs. Green dots 
represent small configurations, blue dots represent large configurations. All runs start 
in a small (plain) configuration. Within the first n configurations: the S-run visits [3"*"* 
at most |2] times, the F, D runs visit [p"*** at least |2| +1 times. A D-run is a special 


case of an F-run which does not visit label | (red dot) in the first n steps. 


The value of n/v distinguishes the S-Runs from the F-Runs. Our goal is to 
give an upper bound on the measure of the S-Runs. For a prefix path 7 of length 
n, there are (2-1) ways to choose the m — 1 indices along 7 at which T5} is 


reached (since the run starts from [§"*""). Each of the m — 1 path fragments 
between these indices represents one consecutive revisit of [3"*’. By Lemma 8, 


the measure of the set of such runs is bounded by (Gp)" ™ = (22) 7 , giving 


Probp (SRuns (7) (n)) < Shed, (mt) -gB-™ < ( §. (54): (+ v3.1) 


under the condition that 4 < 2-v < n. The second inequality is obtained 


through algebraic manipulations using Gp = 2v2, Define f(x) := $ . (=) . 


1 
(2+ v3-2)'*!. We have f(150) = 0.986 < 1. Hence, for parameter v := 150, 
defining E$ := f(v), we have the following lemma, where the bound decays 
exponentially with n since E$ < 1. 


Lemma 9 (S-Run Bound) Probp (y p SRuns (7) (n)) < (E$)", for all y € 
Yo and all n such that 300 =2-v<n. 


Reaching the label | We now turn our attention to the set of F-Runs. Our 
goal is to show that if an F-Run reaches | then, with a high probability, it will 
reach | “quickly”. To that end, we consider the opposite scenario and introduce 
a subset of the F-Runs which we call Delayed Runs (D-Runs): 


DRuns (7) (I) (n) := Um=|" 41 {p € Runs (7) | pEp &™lA Visitp (n,m)} 
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A D-Run is an F-Run that delays its first visit to the label | until the n“” step 
for some n. We show that the measure of D-Runs decreases n increases. Note 
that | is reachable from all configurations from a path that ends at |. Therefore, 
we consider the set A := {y € TAH | y Ep 30l}, of small configurations from 
which | is reachable. We analyze how often a run starting from a small con- 
figuration, visits A before finally visiting the label |. For sets of configurations 
G1, G2 C Ip, a run p, and m €E N, we write p = Gi Before™ Go to denote that 
p visits the set G at least m times before visiting Gz for the first time. Notice 


n 
DRuns (7) (I) (n) C U {p € Runs (y) | p Ep ABefore”™ I} (2) 
m=|ġ]+1 
To upper bound the measure of D-Runs, we start by upper bounding the mea- 
sure of the set {p € Runs (y) | p Ep ABefore” |}, i.e. y-runs making m visits 
to A before visiting |. We consider the probability that a run from a small con- 
figuration y does visit | before returning to y. We can compute a u such that 


0<u< min Probp (y = O(\Before! 7)) (3) 
VE 


Hence p is a lower bound on the measure of runs that start from some configu- 
ration in y € A and visit | before returning to y. To obtain an upper bound on 
the measure of D-Runs, we show the following inequality: 


n 


Prop (DRuns (7) (I)(n)) << È whl eH L p=) 


m=[Z]+1 VEA T a-p) (1-0-p) AT 


The first inequality follows from formulas 2 and 3, while the second is obtained 
1 
through algebraic techniques. Define E} such that (l—y)“14l < E} < 1. Such an 


1 
ER is computable since v, A, u are computable. Since (1— 1) 714 < E} it follows 
that there is a natural number, denoted by 73, such that IAI i 


(1-p)-=(1=4) 47) 
1 n 
(a = pra) < (E2)” for all n > np. This gives the following lemma. 


Lemma 10 (D-Run Bound) Probp (DRuns (y) (I) (n)) < (€3)", for all y € 
CS and all n > nh. 


Proof of Lemma 7 We now give a sketch of the proof of the eagerness property. 
Choose a value EẸ such that, max(€Z,ER) < EF < 1. From Lemma 9 
and Lemma 10 it follows that for some constant n? > max(7p,300), 
Probp (y =p O”) < (EP)", for all n > nP (sufficiently large). The final step is 
to extend the argument to the set of y-runs that reach | in n or more steps (as 
required by Lemma 7). 
Probp (7 Ep È"I) = EZ, Probe (7 Ep Ol) < DZ, (EP) = Fe 


Choose Ep, (exists since EP < 1) such that EP < Ep < 1. There exists an np 


such that 2 < (Ep)” for all n > np, and hence Probp (y Ep F”) > (Ep)” 


for all n > np (sufficiently large). This gives us the result. 
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8.3 The Algorithm 


Now we proceed to describe the algorithm. The goal is to approximate 
E (Xyinulcost | Yinit F 301). The scheme followed by the algorithm is similar 
to the quantitative section: it iteratively improves an approximations until it is 
€-precise. However, the implementation is much more challenging since we need 
to maintain error margins on both the cost and the probabilities. It performs 
forward reachability analysis, starting from 7yiniz, and generating, successively 
longer Yinit-paths, in a breadth-first manner. 

The variable waiting contains triples of form (y, Y, 6) corresponding to Yinit- 
paths waiting to be analysed. For such a path 7, y is the last configuration of 7, 
w is the cost of 7, and ¢ is the probability of taking 7. We initialize waiting to 
contain a triple corresponding to the empty path from Yinit: (Yinit, 0, 1). Prior to 
the i!” iteration loop (line 10), waiting contains triples corresponding to paths 
of length i. At each loop iteration the triples in waiting are analysed and the 
triples for paths one step deeper are generated for the next iteration. 


Algorithm: Solving Exp_AVE_CosT 


Input: P: program; Yinie € lp: configuration | € Lblp: label with Yinit = AO; ; 
Cost: Instrp > R: cost function; e € R*°: precision; 


1 Var 
2 waiting, waiting’ € (Ip x R x R)”: queues; 
3 CostApprx € R: approximation of E (Xy,),cost); 
4 ProbApprx € R: under-approximation of Probppyy« (y Ep Ol); 
5 CostError € R, ProbError € R: over-approximations of errors; 
6 k,n eN; 
7 k :=MaxCost (Cost); n := 0; 
8 CostApprx := 0; heel = 0; waiting := (Vinit, 0,1); 
9 CostError : = wep ProbError := =e 7 
10 repeat 
11 n:=n + l; waiting’ := 0; 
12 for i = 1 to |waiting| do 
13 (y, Y, $) := waiting[#]; 
14 if | € y then 
15 | CostApprx := CostApprx + 7 - @; ProbApprx := ProbApprx + ¢; 
16 else 
17 for all y : y >p 7 do 
18 | | waiting’ := waiting’ - (y, Y + Cost (y, y’) , $- Mp (7,7')); 
19 CostError := CostError - Ep; ProbError := ProbError - Ep; 
20 waiting := waiting’; 
21 until (Ser a EEr < €) A (ProbError > 0) A (n > np); 


CostApprx 


22 return ProbApprx+ProbError 


Fig. 6. The expected average cost algorithm. 


The iterations calculate increasingly precise approximations of 
E (Xyinilcost), and of Probp (Yini Hp Ol), maintained in variables CostApprx 
and ProbApprx, respectively. We maintain two additional variables (CostError 
and ProbError) that help us to provide an upper bound on the estimation 
errors. Defining MaxCost (Cost) := max {Cost (I) | 1€ Lblp}, we explain the 
correctness of the algorithm with a number of invariants. 
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Lemma 11 The algorithm maintains the following invariants where invariants 
(1,2,5,6) hold for alli > 0 and invariants (3,4) hold for alli > np. 


1. CostApprx™ = 5 Cost (p) - Probp (p): 
{pERuns (Yini) | p= t1} 


2. ProbApprx”) = Probp (Yinit = (Sl): 
3. CostApprx") < E(Xy cost) < CostApprx + CostError"), l 
4. ProbApprx“) < Probp (y Ep QI) < ProbApprx + ProbError. 
5. CostError’) = MaxCost (Cost) - ay. 
6. ProbError“) = E. 

iE 


Invariants 5 and 6 imply that as i > oo CostError™) and ProbError ?) 
CostApprx‘*) +CostError ) CostApprx‘‘) —CostError“’) = 0 
ProbApprx(?) ProbApprx(*)+ProbError(*) } ~~ 


implying termination. Since n > 7p when the algorithm terminates, by invari- 
ants 3 and 4 it follows that CostApprx’” < E(Xy1cost) < CostApprx(”) + 
CostError’) and ProbApprx’”) < Probp(y pl) < ProbApprx(”) + 
ProbError‘”), Combining these two inequalities and the termination condition 
of the algorithm, we get the following: 


tend to 0. Hence, lim;_,., ( 


CostApprx(”) E(X 1,cost ) < CostApprx‘”) | 
ProbApprx(”)-+ProbError(”) — Probp(yFp Ol) ProbApprx(”)+ProbError(™) ! 


€ 


CostApprx™ 
ProbApprx(”) +ProbError(”) 
true value, implying correctness of the algorithm. We get the following theorem. 


Hence on termination, 0 := is within ¢-precision of the 


Theorem 5. The above algorithm solves EXxP_AVE_COST. 


9 Conclusions, Discussions, and Perspectives 


We presented PTSO, a probabilistic extension of the classical TSO semantics. 
We have shown decidability/computability results for a wide a range of proper- 
ties such as quantitative and qualitative reachability/repeated reachability and 
expected average costs. As far as we know, this is the first study of probabilistic 
verification for weak memory models, and opens many avenues for future work. 


Refined Probability Distributions. For ease of presentation, we developed our 
results in the context of specific scheduling and update policies. However, we 
emphasize that our results carry-over to policies satisfying faithfulness and left- 
orientedness, which are fairly weak conditions. Hence we believe that developing 
more refined models that better capture behaviours of TSO implementations, 
using techniques such as parameter estimation, is interesting future work. 
General Cost Models Similar can be said for cost models: our algorithm works 
for all cost functions such that the cost of a path is exponentially bounded by 
its length. In particular, developing cost models that closely mimic usage of 
processor resources, e.g. cost based on read from local store-buffer vs. read from 
memory, can be useful to gain a better understanding of the implementation. 
Other Memory Models Finally, we are interested in extending our approach to 
other weak memory models such as RA/SRA, POWER, ARM. 
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Abstract. Substructural type systems are growing in popularity be- 
cause they allow for a resourceful interpretation of data which can be 
used to rule out various software bugs. Indeed, substructurality is fi- 
nally taking hold in modern programming; Haskell now has linear types 
roughly based on Girard’s linear logic but integrated via graded function 
arrows, Clean has uniqueness types designed to ensure that values have 
at most a single reference to them, and Rust has an intricate ownership 
system for guaranteeing memory safety. But despite this broad range 
of resourceful type systems, there is comparatively little understanding 
of their relative strengths and weaknesses or whether their underlying 
frameworks can be unified. There is often confusion about whether lin- 
earity and uniqueness are essentially the same, or are instead ‘dual’ to 
one another, or somewhere in between. This paper formalises the re- 
lationship between these two well-studied but rarely contrasted ideas, 
building on two distinct bodies of literature, showing that it is possible 
and advantageous to have both linear and unique types in the same type 
system. We study the guarantees of the resulting system and provide 
a practical implementation in the graded modal setting of the Granule 
language, adding a third kind of modality alongside coeffect and effect 
modalities. We then demonstrate via a benchmark that our implementa- 
tion benefits from expected efficiency gains enabled by adding uniqueness 
to a language that already has a linear basis. 


Keywords: linear types - uniqueness types - substructural logic 


1 Introduction 


Linear types and uniqueness types are two influential and long- 
standing flavours of substructural type system. As these approaches have devel- 


oped, it has become clear in the community (both in folklore and the literature) 
that these are closely related ideas. For example, the chapter on substructurality 
in Advanced Topics in Types and Programming Languages describes unique- 
ness types as “a variant of linear types”. This framing is supported by various 
works which, for example, make reference to “a form of linearity (called unique- 
ness)” or other such statements of equality or similarity [88]. 

But reading a different set of papers gives a contrasting impression that 
linearity and uniqueness are not the same but in some sense dual to one another, 
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and with different behaviour for at least some applications. Recent work on linear 
types for Haskell |7| describes the two concepts as being “at their core, dual” and 
later having a “weak duality”. The impression that these two approaches behave 
differently is backed up by much of the theoretical work on uniqueness types, 
with one paper stating that “although both linear logic and uniqueness typing 
are substructural logics, there are important differences” [56], closely followed 
by a tantalising mention of the fact that “some systems based on linear logic are 
much closer to uniqueness typing than to linear logic”. 

It is clear, at least, that both linear types and uniqueness types are substruc- 
tural type systems: they both restrict structural rules (in particular, contraction 
and weakening) of type systems that are the Curry-Howard counterparts to reg- 
ular intuitionistic logic. This captures the well-known maxim that “not all things 
in life are free” (61); many kinds of data behave resourcefully, and are subject 
to constraints on their usage. Sensitive data should not be infinitely duplicated 
and passed around freely, file handles should not be arbitrarily discarded without 
being properly closed, and communication channels should not be used without 
adherence to a fixed protocol, to name a few! 

Thanks to these clear benefits, notions of substructurality are slowly but 
surely making their way into the programming ecosystem, with languages such 
as Haskell [7], Idris (ol, Clean (47], Rust [24], ATS [65], and Granule all 
having type systems that behave substructurally in some way. What is not clear, 
however, is what exactly the relationship is between these varying systems; for 
instance, it is not obvious how to relate linearity and uniqueness. Linear types, 
though they themselves come in various forms, are most often based on the linear 
logic of Girard E5], and in the strictest sense they treat values as resources which 
must be used exactly once and never again. On the other hand, uniqueness types 
are named as such because they aim to ensure that values are guaranteed to have 
at most one reference to them (40][45][47|[48][55][56), with a view towards allowing 
them to be safely updated in-place. Do these two requirements always coincide, 
or are there cases where they diverge? 

In this paper, we resolve this long-standing confusion, building on two distinct 
bodies of literature to develop an accurate understanding of the contexts in which 
linear and uniqueness types behave the same or behave differently, and their 
relative strengths and weaknesses. Our primary contributions are as follows: 


—In Section [2] we discuss the contrasting understandings of the relationship 
between linearity and uniqueness, and draw together aspects of these view- 
points to intuitively describe the link between these concepts. 

—In Section [8] we formalise these notions by developing a unified calculus and 
type system that incorporates linear, unique, and Cartesian (unrestricted or 
non-unique/non-linear) types all at once, building on the linear A-calculus. 
We give an operational model via a heap semantics which allows us to prove 
various key operational guarantees for both linearity and uniqueness. 

— In Section as a proof of concept, we implement uniqueness types into 
the language Granule which already has a linear basis, introducing a third 
flavour of modality alongside the graded comonadic (coeffectful) and graded 
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monadic (effectful) modalities already present in the language. The imple- 
mentation enables the classic primary use of uniqueness: access to safe in- 
place update in a functional language without working inside a monad. 

—In Section [4.2] we confirm the performance benefits of uniqueness types by 
benchmarking the performance of arrays which allow for in-place update. 
We generate impure Haskell code from our Granule implementation in order 
to demonstrate that further efficiency can be gained via adding uniqueness 
types even when your language is already linear at its core. 


Section [5] and Section [6] provide related work and discussion, including relation 
to ideas in Rust. Various additional details are collected in the appendix 28}, 
including proofs and collected reduction rules for the operational semantics. We 
also provide an artifact [29], so that the interested reader can experiment with 
code examples in Granule and reproduce our benchmarks for themselves. 


2 Key Ideas 


It is clear that linear and uniqueness types both involve restricting the sub- 
structural rules of intuitionistic logic, but what remains unclear is the exact 
relationship between the two concepts. This section discusses two widespread 
understandings of their relationship, both of which are accurate in some respects 
but fail to capture some key similarities and differences. We then combine aspects 
of both viewpoints to systematically relate linearity and uniqueness. 


2.1 Are linearity and uniqueness (essentially) the same? 


Perhaps the most well-known substructural types are linear types, which have 
been studied for decades in the literature as the Curry-Howard counter- 
part of linear logic [15]. Several languages have implemented linear type systems 
over the years, including ATS (65), Alms and Quill [82], and they are steadily 
making their way into the mainstream via extensions to languages like Haskell [7]. 
Examples of linearity in this paper will focus on the functional language Gran- 
ule (whose syntax resembles Haskell), since values in Granule are linear by 
default making the examples less complex, and also because Granule will later 
be the foundation upon which we build our unified calculus. 

Strictly, linear types treat values as resources which must be used once and 
then never again. For instance, we can type the identity function, since it binds a 
single variable and then uses it, but the K combinator (which discards one of its 
arguments) is not linearly typed. Thus linearity is a claim about the consump- 
tion of a resource: a linear type is a contract, which says that we must consume a 
value that we are given exactly once. Consider the following classic example (ren- 
dered in Granule) of a function which cannot be represented using linear types, 
assuming an interface where eat : Cake — Happy and have : Cake — Cake: 


1 impossible : Cake — (Happy, Cake) 
2 impossible cake = (eat cake, have cake) 


Ill-typed Granule 
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Note that Granule’s function type — is the type of linear functions, more tra- 
ditionally written —c. The above function is ill-typed and the Granule compiler 
will brand it with a linearity error; this is because the value of type Cake passed 
into the function is a linear resource, and the body of the function requires us 
to duplicate it (via contraction), which is forbidden. Thus, linear types remind 
us of the familiar aphorism: you can’t have your cake and eat it too. 


Uniqueness types, on the other hand, are primarily aimed at ensuring that 
values have only a single reference to them, which is a useful property for ensuring 
the safety of updating data in-place. But is this uniqueness restriction really so 
different from the constraints of linearity? 


One of the most familiar languages featuring uniqueness types is Clean [47], 
which uses uniqueness for mutable state and input/output, in contrast to lan- 
guages such as Haskell which use monads for similar purposes. We shall use 
Clean for our uniqueness examples for the moment, before we introduce our own 
implementation of uniqueness in Section [4] Consider the following in Clean: 


1 impossible :: *Coffee -> (*Awake, *Coffee) 
2 impossible coffee = (drink coffee, keep coffee) 


Ill-typed Clean 


We use coffee instead of cake to distinguish unique values from the linear values 
of the Granule example, but notice this function has exactly the same structure 
as the previous example. The operator * denotes a unique type, since unre- 
stricted values are the default in Clean. Similarly to Granule, when presented 
with this function Clean gives a uniqueness error; the argument of type *Coffee 
is duplicated, and so we can no longer guarantee there is only one reference to 
it upon exiting the function. Think of a «Coffee as having been freshly poured; 
we cannot continue acting as though it is fresh once some of it has been drunk! 


So far, it seems that the concepts of linearity and uniqueness are very similar 
after all, as is often claimed. However, neither of these examples uses unrestricted 
values; we only see values that are linearly typed or uniquely typed. In fact, in a 
setting where all values must be linear, we can also guarantee that every value 
is unique, and vice versa! Intuitively, if it is never possible to duplicate a value, 
then it will never be possible for said value to have multiple references. It is 
when we also have the ability for unrestricted use (non-linear/non-unique) that 
differences between linearity and uniqueness begin to arise, as we will soon see. 


Much of the classic literature on linear types makes mention of the idea that 
linearity can be used for tracking whether a value has only one reference, though 
we know by now that this more accurately describes uniqueness; indeed, one of 
the oldest such papers by Wadler, which has been (rightly) hugely influential, 
states that “values of linear type have exactly one reference to them, and so 
require no garbage collection” p.2]. However, systems akin to the one dis- 
cussed by Wadler crucially separate values into two completely distinct 
and (mostly) disconnected linear and non-linear worlds. In this context, a linear 
value can never have been duplicated previously and thus must also obey the 
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conditions required for uniqueness. Therefore, it is correct to say that a value of 
linear type has exactly one reference in such a system. 


This issue is further discussed in a later article by Wadler [8], though unique- 
ness types had yet to be invented and so the concept is never referred to by this 
name. Linear types based on linear logic are defined in Section 3 of said article, 
for which linearity behaves as we understand it: a value having linear type guar- 
antees that it will not be duplicated or discarded, but the notion of dereliction 
allows a non-linear variable to be used linearly going forwards [15]. As Wadler 
states, “dereliction means we cannot guarantee a priori that a variable of linear 
type has exactly one pointer to it” p.7], and so we cannot guarantee unique- 
ness of reference in a system based upon linear logic. In Section 7, Wadler goes 
on to define steadfast types, where dereliction and promotion are again restricted 
to recover the uniqueness guarantee in addition to the linearity restriction 

However, never being able to duplicate or discard any value is an overly 
restrictive view of data, preventing many valid uses of various kinds of informa- 
tion, and so modern languages with linear types therefore generally do provide a 
mechanism for non-linearity rather than working in the ‘steadfast’ style. Linear 
logic provides the ! modality (also called the exponential modality), which allows 
the representation of non-linear (unrestricted) values. In Granule, we can use this 
modality to rewrite the previous example into one that is now well-typed: 


1 possible : !Cake — (Happy, Cake) 
2 possible lots = let !cake = lots in (eat cake, have cake) 


Granule 


We can think of !Cake values as representing an infinite amount of cake, which 
is made available once we eliminate the modality (via the let) to get an unre- 
stricted (non-linear) variable cake. The functions eat and have are linear func- 
tions, so each application in isolation views cake linearly, by an implicit use of 
dereliction in the type system. Crucially, from an unrestricted value we can pro- 
duce a linear value, so we can impose the restriction of linearity whenever we 
like. However it is not possible to produce an unrestricted value from a linear 
one. This restriction means that linear types are useful for representing resources 
such as file handles, as in the following example: 


1 twoChars : (Char, Char) <I0> 
twoChars = let -- do-notation like syntar 
h <— openHandle ReadMode "someFile"; 
(h, ci) + readChar h; 
f (h, c2) + readChar h; 
6 O < closeHandle h 
7 in pure (c1, c2) 


Granule 


The concept of steadfastness coincides with the notion of “necessarily unique” used 
in languages such as Clean, where a necessarily unique value is one that is unique 
and also can never be made non-unique 47]. 
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Here, we open a file handle, read two characters from it, and then close it. The 
linearity of the handle ensures that once we have created it, we must close it 
properly, and also that we cannot duplicate it along the way. But linearity is 
less useful in other circumstances. As an example, consider the case of mutable 
arrays. Discarding an array will not cause any problems|*|so a linear array would 
be too restrictive and not allow for some valid use cases; affine types allow 
discarding behaviour by adding back in weakening [52]. But in order to be able 
to mutate an array we need to be able to guarantee that no other references 
to it exist, and in this sense linearity is not strong enough; any linear value 
could have previously been a non-linear one that was duplicated any number of 
times before being specialised (via dereliction) to a linear type. For representing 
mutable arrays, we are better served by considering uniqueness types. 


Uniqueness behaves differently to linearity in the context of a system with 
the ability to describe unrestricted values. If we have an unrestricted value, we 
certainly cannot produce a unique one from it which would violate the guarantee 
of uniqueness; we cannot claim that a value has only one reference to it when 
it could have been duplicated and manipulated elsewhere. But conversely, if we 
have a unique value, there is no harm in dropping this guarantee and producing 
an unrestricted value; a non-unique value does not need to make any promises 
about how many references may exist. Thus, in Clean we can write: 


ı possible :: *Coffee -> (Awake, Coffee) 
possible coffee = (drink coffee, keep coffee) 


Clean 


Here, we require that the input is unique (it has type *Coffee), so for the function 
to be well-typed we can no longer claim this value is unique once it reaches 
the output, as it has been duplicated along the way (and it now must have 
type Coffee). The information here is flowing in the opposite direction than 
for linearity; the possible function in Clean would be ill-typed if we replaced 
unique values with linear ones, and vice versa for the earlier Granule example. 
This directionality allows us to represent mutable arrays much more easily with 
uniqueness. For example, the following destructively fills a real-valued array: 


ı fill :: *{Real} Int -> *{Real} 

2 fill al 0 = al 

» fill al i 

4 # f = toReal i 

# a2 = {a1 & [i - 1] = f} // write f to indez i-1 ir 

6 = fill a2 (i - 1) // recurse with unique array 
Clean 


One may worry that discarding an array could cause space leaks, but this can be 
tempered via garbage collection. If a non-linear value will no longer be used we know 
statically that it can be garbage collected, and thus it is harmless to reuse the space 
occupied by this value going forwards. This will allow us to update unique objects 
such as arrays destructively without being concerned about referential transparency. 
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Here, we take in a unique array of floating point numbers and some unrestricted 
integer value, and fill the first cells of the array with the numbers up to that 
value. Here we know that it is safe to write to the array because it is unique, so 
no other references to it can exist elsewhere; once we are finished with the array 
later on, however, it is fine to discard it, as with an array in most other functional 
programming languages, which would not be possible if our array was linearly 
typed. This however does mean that uniqueness types are not appropriate for 
the earlier example of file handles—we cannot ensure that a unique file handle 
is closed, as it can be discarded at any time. 

In summary, linearity and uniqueness provide the same guarantees up until 
a system also has a notion of unrestricted value (non-linear or non-unique). The 
complementary but distinct use cases shown above make it clear that it would be 
valuable to have both linear and unique values together in a single language, but 
this has previously not been possible. Our main contribution is a core calculus 
that allows linearity and uniqueness to coexist and interact, demonstrated also 
via an implementation in the Granule language. Next we consider the question 
of duality, and how to formally describe how linearity and uniqueness differ. 


2.2 Are linearity and uniqueness dual? 


It is common in folklore and in the literature to describe linearity and uniqueness 
as somehow dual to one another (see e.g., (32}(52]) but rigorous versions of this 
statement are more rarely found. The earliest formalisation of uniqueness is from 
Harrington’s ‘uniqueness logic’ [20], which we use as a foundation for much of the 
following. Harrington constructs a logic which is on the surface much like linear 
logic, but instead of the ! modality for non-linearity it includes a o modality for 
non-uniqueness which differs from non-linearity in its introduction rule. 

In linear logic, the ! modality acts as a comonad, such that the introduction 
of ! on the right of a sequent means that all formulae on the left of a sequent 
must also have ! applied, whilst introduction of ! on the left is unrestricted: 


PER, BPEO, 
Ireip® TIPFQ" 


(also known as storage and dereliction respectively (16]). In contrast, the non- 
uniqueness modality o of Harrington acts as a monad, meaning that introduction 
of o on the right is unrestricted but introduction of o on the left of a sequent 
means that all formulae on the right of the sequent must also have o applied: 


rHP T,PH Q? 
OR, OL 
TF Pe T, PeF Q? 


The non-uniqueness modality o then has the following structural rules for con- 
traction and weakening, which are conspicuously identical to those for the ! 
modality representing non-linearity: 
IT, P°,P° F R TER T,!P,!PH R TER 
OCG Ow la eW 
T, PeH R I,P°FR I,IPFR T,!PHF R 
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One might be tempted to think that because the introduction rules for o behave 
dually to those of !, the modalities are simply dual to one another, and thus 
non-uniqueness is equivalent to linear logic’s ?. But since the contraction and 
weakening rules for o are the same as those for !, this is not quite the case; o 
behaves dually to ! in some ways but not in others. Formally, o is a monad while 
! is a comonad, but both are comonoidal, whereas ? is monoidal. 

Linear logic allows us to derive !P | P (from dereliction), which agrees with 
our notion of linearity where non-linear values can be restricted to behave linearly 
going forwards but if we have a linear value it must remain linear; uniqueness 
logic conversely allows us to derive P+ P°, formalising our concept of uniqueness 
where we can forget the uniqueness guarantee and turn a unique value into a 
non-unique one, but if we have a non-unique value we cannot go back. 

We can now make more precise the intuitive notion we have developed, which 
suggests that linear types provide a restriction on what can be done with a value 
‘in the future’ whilst uniqueness types provide a guarantee about what has been 
done with a value ‘in the past’. The distinction becomes clearer when we consider 
substitutions, which are generated by $-reductions. 

Substitutions are the same whether we are working with linear logic or 
uniqueness logic, as the rules for functions are identical, but the difference arises 
when thinking about what it is possible to know about a value in one logic com- 
pared to the other. Given a linear value, we know that substituting this value 
into an expression will preserve linearity, as there is no way to transform a linear 
value into a non-linear one. Conversely, given a unique expression then we know 
that any values substituted in will not affect the uniqueness guarantee, as there 
is no way to transform a non-unique value into a unique one. Thus ‘future’ refers 
to outgoing substitutions, while ‘past’ refers to incoming substitutions. 

So if linearity and uniqueness do in fact behave the same in some ways but 
not all, and they do in fact behave dually in some ways but not all, then what 
is the overall takeaway? What statement can we make about the relationship 
between their behaviour that reconciles these two viewpoints? 


Takeaway. Linearity and uniqueness behave dually with respect to composition, 
but identically with respect to structural rules, i.e., their internal plumbing. 


In other words, internally the non-linear and non-unique modalities are both 
comonoidal, so they allow for the same behaviour of contraction and weakening 
for values that are wrapped inside them. 

But the duality arises upon considering how we can map into and out of 
these modalities; we can map out of the non-linear modality and retrieve a linear 
value, but we can never map into it, giving the modality its familiar comonadic 
structure. Conversely, we can map a unique value into the non-unique modality 
to allow for contraction and weakening, but we can never map back out of it, 
which explains the dual monadic behaviour of this modality. 

It is this understanding of the similarities and differences between linearity 
and uniqueness that will allow us to unify them, and have values of both flavours 
present in a single type system, which will be our goal for the next section. 
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3 The Linear-Cartesian-Unique Calculus 


We now consider how to represent both linearity and uniqueness in the same 
system. The first choice to make is whether our base values will be linear or 
uniqud} as this will influence the directionality of the modalities we need to 
include in the calculus. Here we present a system where linearity is the base and 
uniqueness is a modality, as opposed to one where uniqueness is the base and 
linearity is a modality, for two reasons. 


— The first reason is pragmatic; we later implement our approach in Granule, 
which already has linear values as the default. Therefore, including unique- 
ness as an additional modality in the system will require far fewer changes to 
the language, since a unique base would most likely require a redesign. More- 
over, languages with uniqueness types like Clean generally have non-unique 
values as their default, with uniqueness having to be specifically annotated; 
a system with a uniqueness modality will also map more closely onto these 
languages than one where uniqueness is the basis. 

— The second reason is that developing a sound calculus with a unique base 
is more complex. Consider such a hypothetical calculus with a modality o 
representing unrestricted values and a modality e representing linear values. 
If we construct a product of linear values (a®, b°), then this product is unique 
(rather than linear), so we can promote to an unrestricted product (a°, b°)? 
and freely duplicate the product, though the values contained within are 
linear. A linear base avoids this problem (among others) as products being 
linear by default means that their usage is maximally restricted, so there is 
no circumventing either a uniqueness guarantee via their construction or a 
linearity restriction via their duplication] 


Given a linear basis, we formalise the idea that we can map from unique to 
non-unique and from non-linear to linear. The key insight is that we treat non- 
linearity and non-uniqueness as the same state as both these states are un- 
restricted; we can do anything we like with, and have no guarantees for, an 
unrestricted value. We write xP for a P with a uniqueness guarantee, similar to 
the syntax of Clean and to avoid confusion with Harrington’s o modality for non- 
uniqueness. The resulting calculus, which we call the Linear-Cartesian-Unique 
calculus (or LCU for short), builds on (intuitionistic multiplicative exponential) 
linear logic with additional rules for uniqueness. 


5 We choose a substructural basis over an unrestricted one since this more closely maps 
to both linear and uniqueness logic, where values have substructural behaviour by 
default unless they are wrapped in a modality. 

6 A similar problem arises from the application of unique functions, and this has been 
a thorn in the side of developers of uniqueness type systems for some time. The 
solution applied in Clean is that any function with unique elements in its closure is 
“necessarily unique”, meaning it cannot be subtyped into a non-unique function and 
applied multiple times. Handily, this coincides with the notion of a linear function, 
which is why our calculus having a linear base also avoids this problem. 


Linearity and Uniqueness: An Entente Cordiale 355 


Syntax LCU’s syntax is that of the linear A-calculus with multiplicative products 
and unit (first line of syntax below) with terms for introducing and eliminating 
the ! modality and working with the uniqueness modality (second line): 


t= 2 | Ax.t | ty ty | (t, t2) | let (x,y) = tı in ty | unit | let unit = ty in to 
| !¢ | let !a = t in ta | &t | copy t as xin ty | xt (terms) 


The meaning is explained in the next section with reference to typing. 


3.1 Typing 
Typing judgments are of the form I F t : A, with types A defined: 
A,B:=A—-B|A@B|1|!A|x*A (types) 


Thus our type syntax comprises linear function types A — B, linear multiplica- 
tive products A ® B, a linear multiplicative unit 1, the non-linearity modality 
lA and the uniqueness modality «A. 

Typing contexts are defined as follows: 


P:=@|I,c:A|I,x: [A] (contexts) 


which are either empty, or contexts extended with a linear assignment z : A or 
contexts extended with a non-linear assignment denoted x : [A]. This marking 
of assumptions in a context as linear or non-linear (see Terui [50]) is one way to 
guarantee substitution is admissible (avoiding, for example, issues pointed out 
by Wadler where substitution is not well-typed if care is not taken 59ļ[60], an 
issue noted also by Prawitz in 1965 in the context of S4 modal logic ). 

Throughout, the comma operator , concatenates disjoint contexts. 

We introduce the key typing rules inline. Figure [1]collects the full set of rules. 
The linear A-calculus core is typed by the following three rules: 

T,c: Art: B Iyk#t:A~B Ig t:A 


A A A 
[P,c:Ara:A Pratt: AB i Ii +In+tt:B we 


In the case of VAR, a linear variable is used but the rest of the context must be 
marked as non-linear, denoted by [I] which marks all assumptions as non-linear. 


Definition 1 (All non-linear assumptions). A context I is denoted as con- 
taining only non-linear assumptions by writing |I] in the typing rules, where [Ø] 
and [T] => [T], <: [A]. 

In the case of APP, the two subterms are typed in different contexts which 


are then combined via context addition. 


Definition 2 (Context addition). The partial operation + on contexts is the 
union of two contexts as long as they are disjoint in their linear assumptions and 
any variables occurring in both contexts are both non-linear assumptions, i.e. 


Ii, +I = I UTI iff Va € dom(T1) N dom(T>) => FAL, (x) = [4(x) = [A] 
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The non-linear modality ! has the following introduction and elimination rules 
and related dereliction rule: 


[rl] t:A i DFA:!A Dz: [Al b:B l T,xz:AFt:B 
MTJFEQA “ D+hFletlr=hing:B © T,x:[AjFt:B 


DER 


The left-most rule captures the idea that a computation t of value A can be used 
non-linearly, by ‘promoting’ it to !A as long as all its inputs are also non-linear, 
denoted by |I] in the context. The middle rule eliminates a non-linear modality 
(a capability to use an A value non-linearly) by composing it with a variable x 
which is non-linear in tg. These rules are accompanied by the ‘dereliction’ rule 
that says non-linear variables can be treated as linear variables. 

So far everything is standard from other linear type systems. We now move 
to our uniqueness modality which has two syntactic constructs: borrow and copy: 


PEt: «A Ty b}t:!A Ig,a:*xAbkt:!B 
= BORRO CO 
TEGIJA TSW Ii +I F copy t aszin t :!B al 


The borrow rule maps a unique value to a non-linear value, allowing a uniqueness 
guarantee to be forgotten. In terms of the operational semantics (see Section[3.4p, 
this causes evaluation of t before the borrow. Next, the copy rule says that a 
non-linear value of type A can be copied to produce a unique A which is used by 
tg; the input is required to be non-linear so that we cannot circumvent a linearity 
restriction by copying a linear value, and the output is required to be non-unique 
so that we cannot leverage the copy to smuggle out a value which pretends to 
be truly unique. These rules in turn are accompanied by the ‘necessitation’ rule 
that says values can be assumed unique as long as they have no dependencies: 


Ort: A 
[I] F xt: xA jig 


The borrow and copy rules in this logic suggest a monad-like relationship between 
the ! and * modalities, with the borrow rule representing the ‘return’ of the 
monad and the copy rule likewise acting as the ‘bind’. The * modality is not 
in itself a monad (or indeed, a comonad like !); rather, it acts as a functor over 
which the ! modality becomes a relative monad (3). A relative monad comprises 
a functor J and an object mapping T, along with an operation 7: JX => TX 
and a mapping from JX — TY arrows to TX — TY with axioms analogous to 
the monad axioms. Thus, here J is the uniqueness modality x and T the non- 
linearity modality !. If one imagines the dual version of this logic where the basis 
is unique, the hypothetical linearity modality would act as a functor making the 
non-uniqueness modality into a relative comonad in much the same way. 


3.2 Equational theory 


One way of understanding the meaning of the LCU calculus is to see its equa- 
tional theory (which we later prove sound against its operational model). The 
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Tica: Att:B Ii(}th:A~B Int: A 
r Argp A  Triet:A@e” Y Trias = 
DhFA:A DLFt:B I;F-t:A@B Ig,n:A,y: Br: C @ 
Title (h,t):A@B Tı + IF let (x,y) = hint: C ss 


MFt:1 Int t2:B 


1 
T]Funt:1 ° +E letunit =hinh:B ” 
T,x:AFt:B aan [[] Ft:A i Mkt:!A I,m: [A]F t: B i 
T,xz:[A]F t: B MJFE:A ”  Nh+hFletk=hinh:B ” 

Tet: *A TyFu:!A Ig,a:*Ab tp: !B ØFt:A 
ree Oe’ -RFF copyhassinn:!B COY MIF ea S 


Fig. 1: Collected typing rules for LCU calculus 


calculus has the standard $7-equalities for the multiplicative linear A-calculus 
fragment, which includes the following 8n rules for !: 


let!a = !tin t = [t/a]t’ (8!) 
let!la = tin!lz = t (n!) 


along with the following equalities on the uniqueness fragment: 


copy taszin&r = t (unitR) 
copy &vaszint’ = [v/z]t' (unitL) 
copy t; as z in (copy tz as y in t3) = copy (copy t as z in t2) as y in ts (assoc) 


The first axiom states that copying a non-linear t into a unique value x and 
immediately borrowing it to be non-linear is equivalent to just t. The second 
axiom states that borrowing a unique value v and copying it to a unique x in 
the scope of t’ is the same as just substituting in that v for x. The last axiom 
gives associativity of copying under the side condition that x is free in t3. These 
equations are exactly the relative monad axioms (3), though we specialise (unitL) 
slightly by restricting to values to account for the reduction semantics. 

The typability of these axioms relies on the admissibility of linear and non- 
linear substitution shown in Section 3.5]on the metatheory of the calculus. 


3.3 Exploiting uniqueness for mutation 


A key use for ensuring uniqueness of a reference is that this allows mutation 
to be used safely—the original pun behind Wadler’s “Linear Types can Change 
the World” [57]. To illustrate this idea, and consider its soundness in the next 
section, we extend the LCU calculus with a primitive type of arrays: 


A::=...| Array A| N| F 
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where N are natural numbers used for sizes and indices and F floating-point 
values. The calculus is also extended with operations for floating-point arrays, 
typed by axiomatic rules (with built-in weakening): 


- newArray : N — «(Array F) 

- readArray : (Array F) — N — F & «(Array F) 
+ writeArray : (Array F) — N — F — «(Array F) 
A- deleteArray : «(Array F) — 1 


These operations provide the interface for exploiting unique array references, 
where writeArray performs mutation as the type system guarantees that uniquely 
typed values have not been duplicated in the past (Section 3.5). We ignore out- 
of-bounds exceptions as this is an orthogonal issue, which could be solved using 
indexed types. We elide rules for typing numerical terms here. 

Our implementation in Section [4] replays these ideas in a practical setting. 
The next section gives the operational heap model for the calculus, where the 
semantics of mutation is made concrete. 


3.4 Operational heap model 


We define an operational model for the LCU calculus to make the meaning 
of uniqueness and linearity more concrete, and to prove that our type system 
enforces the desired properties. The semantics is call-by-name and resembles 
a small-step operational semantics but instead uses a notion of heaps, both to 
capture the idea of a memory reference to arrays as well as to give a way to 
track resource usage on program variables. We adapt the model of Choudhury 
et al. 1], which was used to track resource usage in a pure language with graded 
types. Our model applies this idea to a non-graded setting, extended to include 
reference counting for uniqueness. To prove that linearity and uniqueness are 
respected (soundness), the heap semantics incorporates some typing information 
in order to ease the theorem statements and proofs as shown in Section 
Single-step reductions in the operational model are of the form: 


Hrttw HEHUI|TI|A (single-step judgment form) 


where H is the incoming heap which provides bindings to variables that appear 
in ¢ and array allocations. The result of the reduction is a new term t’ with an 
updated heap H’, as well as two additional pieces of information: I" gives us 
a ‘binding context’ recording the typing of any binders that were encountered 
(or ‘opened’) during reduction, and A gives us a ‘usage context’ containing an 
account of how variables were used. Usage contexts are defined as: 


A:=@|A,x:r (usage contexts) r::=1 |w (usage/reference counter) 


where r is a usage marker that says a variable was used either once (denoted 1) 
or used more than once (denoted w). Usage has a preorder < where 1 < w. 

We extend the syntax of terms with a value form a representing runtime 
array references to the heap. In order to account for their type, the syntax of 
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contexts is extended to include assumptions a : Array A which are treated as 
a different syntactic category of variables. Additional runtime typing rules for 
array reference terms a are provided akin to a variable rule (see appendix [28]). 

Heaps are defined as follows akin to a context but containing two kinds of 
‘allocations’ for variables x and for array references a: 


H := | Hx, (I F t: A) | H, a4,arr (heaps) 


In the case of extending the heap with a variable allocation for x, the heap records 
that x can be used according to r and that it maps to a term t, along with its 
typing which is only present to aid the metatheory. For brevity, we sometimes 
write x,t instead of zr—>,(I + t: A) when the typing is not important. In the 
case of an array reference a, the heap records the number of references currently 
held to it, where r is again used (representing either one reference 1 or many w), 
and describes the heap-only array representation term arr pointed to by that 
reference (whose syntax we introduce later along with the relevant rules). 
Multiple reductions are composed from zero or more single-step reductions, 
with judgments of the form H F t > H'F t |T | A given by two rules 
capturing empty reduction sequences and extending a sequence at its head: 


HF ~ H'Ft|N |4 


H'F bt > H"F t| |42 
REFL EXT 
Htt > HFt|0|0 Ath > A” tg |D, |4 +42 


In the case of EXT the binding contexts are disjoint (since we treat binders as 
unique in a standard way) but the usage contexts are added as follows: 


(A, +As),a:r x g dom(A)) 
A = A Ao, c:r)= 
1+0 N) crt 2,0 r) ieee ee Ay = Ay, r:r 


i.e., if a variable z appears in both usage contexts then in the resulting context 
x : w since for the purposes of our counting we are interested in counting 0 uses 
(via absence in A) or 1 use or many uses (w). 


Heap model The reduction rules are collected in the appendix [28], but we 
explain the core rules for the single-step reduction relation here. Unlike a normal 
small-step semantics, variables have a reduction, with two possibilities: 


“VARI ~~ VARW 


H,r thaw HAF t|Ola:1 H,toy thaw Hoo, th t|Ola:1 


Both reduce a variable x to the term t which is assigned to x in the heap. 
In the left rule, we started out with a heap capability of 1 (linear) so after the 
reduction we remove x from the heap. In the right rule, we have a heap capability 
of w (non-linear) so we preserve the assignment to x in the outgoing heap. 

8-reduction is then given as follows: 


PET tA 
He (ac.t)t ~ H, (TFE: A)Ft|c: Alo ° 
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Rather than using a substitution, the body term is the result under a heap 
extended with x assigned to the (typed) argument term t. This heap binding 
is given a resource capability of 1 since functions are linear. In the output, we 
remember that a linear binding has been opened up in the scope of the term. 
An inductive rule allows an application to reduce on the left: 


Hews WEE |PIA 
7 ~ APP 
HFtt ~~ HFth|rl|a 


We elide the rules for products and unit which follow much the same scheme; one 
congruence to evaluate the reduction of an elimination form and one to enact a 
8 reduction. For the ! modality, this scheme gives us the !8 rule which creates a 
non-linear binding of x to the term t: 


F]Fa:A 
HF let!c—lhinh ~ H,o>oo(U |F f: AF tls: [A0 ” 


The more interesting rules are for the uniqueness aspects of the language. Bor- 
rowing & (which maps a unique value type «A to a non-linear value !A) has a 
congruence rule and a reduction to enact a borrow: 


H-t ~~ HFr|rija dom(H) = arrRefs(v) 
HF&t o Wek’ A © H, HF kv) ~ (A, loo e 


The action is in the right-hand rule here, where the incoming heap is split into 
two parts, where H is such that it provides the allocations for all array references 
in v (enforced by the premise here). The unique value xv is wrapped to be non- 
linear in the result !v and thus all of its array references are now marked as 
‘many’ via [H]_, which replaces all reference counts with w, e.g.: 


H', amarr F &(«a) ~ H’,a4,arrt !a|0|0 


Thus, borrowing enacts the idea that a reference is no longer unique and may be 
used many times (and hence now is a non-linear value). Copying then has three 
reductions; a congruence (elided), a reduction which forces evaluation under the 
non-linear modality, and a -reduction to enact copying to a unique value: 


Htt»H+EY|PlA 
H F- copy !tas zin t2 ~ H’ F copy!t/ as xin ty | I | A oe 


[Tkv:A_ dom(H’) = arrRefs(v)  (H",@) = copy(H’) 
H, H' F copy!vaszint ~ H, H’, H”, xo (L F *0(v) :*A)F t|2:*A|O P 


The copy! rule evaluates under ! so that the first term can be reduced to a 
value v to be copied in the next rule. The ~œcopyg rule enacts copying where 
dom(H") = arrRefs(v) marks the part of the heap with array references coming 
from v. Then copy(H’) copies the arrays in this part of the heap, creating a heap 
fragment H” and a renaming 6 which maps from the old array references to the 
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references of the new copies. This renaming is applied to v in the newly bound 
unique variable x. Thus the value 6(v) refers to any freshly copied arrays. 

Lastly, the semantics of the array primitives uses an array representation on 
the heap, where arr is some array object and arr[i] = v indicates that the i*” 
element is bound to the value v, and we write a#H for an array reference a 
which is fresh for heap H: 


a#H 
H F newArrayn ~ H, amarr F xa|0|0 


H, a-+,.(arr[i] = v) H readArray (xa) i ~ H,a,(arr[i] = v) (v,*a) |0| 0 


H, a->,arr + writeArray (xa) iv ~ H,a-+,(arr[i] =v) xa |Ø] 0 


H, ac,arr + deleteArray (xa) ~~ Ht unit | Ø| 0 


Thus newArray creates a fresh array reference a and allocates a new array on the 
heap with a single reference count. The readArray and writeArray primitives work 
as expected to read and destructively update the array referenced by a, whose 
reference count is arbitrary but unchanged by the reduction. Lastly deleteArray 
deallocates the array. Noticeably, the rules do not enforce uniqueness; but as we 
see in the next section, well-typed programs preserve uniqueness of references. 


3.5 Metatheory 


Proofs of all the statements that follow are provided in the appendix [28]. We 
first establish some key results showing the admissibility of substitution and 
weakening, which are leveraged in later proofs: 


Lemma 1 (Linear substitution). If I’ | t : A andT,x: AF t: B then 
I" +Tr F [t'/z]t:B. 

Lemma 2 (Non-linear substitution). If [I"] t’: A and T,x:[|A] F t: B 
then [I] +T F [t'/z]t: B. 


Lemma 3 (Weakening is admissible). IfI H t: A then T,[I"] F t: A. 


Next, the heap model allows us to establish the key properties of well-typed 
programs respecting linearity and uniqueness restrictions. We first define when 
a heap is compatible with a typing context: 

Definition 3 (Heap-context compatibility). A heap H is compatible with 
a typing context I’ if H contains assignments for every variable in the context 
and the typing contexts of the terms in the heap are also compatible with the 
heap. The relation is defined inductively as: 


REF LIN 
H, a>,arr m T, a: Array A (A, t>,(1)+ t: A)) œ% (14,2: A) 


E H v% (T; + [I>]) [In] Ft: A x ¢ dom(H) 
Dag Hes i Aaa ~ 


362 D. Marshall et al. 


Thus, a heap compatible with 7}, z : A contains an assignment for z marked with 
a usage annotation r which can be either 1 for linear or w for non-linear use. Note 
that non-linear values can be used linearly, as captured by dereliction (the DER 
typing rule). However, a non-linear assumption must have a heap assignment 
marked with w (rule w), where the dependencies of the assigned term t must all 
be non-linear in the remaining compatibility judgment on the rest of the heap. 
From a heap (and likewise from a typing context) we can also extract usage 
information. This is useful for focusing on resource usage as follows: 


Definition 4 (Usage context extraction). For a context I or heap H we 
can extract usage information denoted I" or H defined as: 


=0 (T,z:[A)=T,z:w (T,a:A)=T (T,z:A)=T,7:1 
=0 (H,x=>,(TFĀFt:A))=H,z:r. (H, a>,t)= H 


S| S&S 


We now give the two main theorems about our calculus which give us the proper- 
ties that linearity is respected (called conservation, Theorem|4) and that unique- 
ness is respected (Theorem |5). 


Theorem 4 (Conservation). For a well-typed term D+ t: A and all Ip and 
H such that H x (Io +T) and a reduction H F t ~ H'F t |T; | A we have: 


ar’. I’ ts A A A’ (Ipt+l") A (A+ A)C (AM) 


The first conjunct is regular type preservation, linked with heap compatibility in 
the second conjunct. The last conjunct expresses the core of conservation: that 
resource usage accrued in this reduction, given by A, plus remaining resources 
given in the heap H’ are approximated (via E, the pointwise lifting of <) by the 
original resources given in the heap H plus the specification of the resources from 
any variable bindings I, encountered along the way. The context I accounts 
for bindings not described by I’, and is key to the inductive proof of this result. 

We then establish that all heap references have only one reference to them 
at the end of execution. 


Theorem 5 (Uniqueness). For a well-typed term I F t: xA and all Ig and 
H such that H òx (Io +T) and given a multi-reduction to a value H F t => 
H' + xv | I” | A, for all a € arrRefs(v) (array references in v) we have: 


a>, € H = A" .as,t" E H' A agdom(H) = At".a5,t" € H' 


i.e., any array references contributing to the final term that are unique in the 
incoming heap stay unique in the resulting term, and any new array references 
contributing to the final term are also unique. 


Notice that there is a certain duality between the conservation theorem and 
the uniqueness theorem which mirrors the weak duality between linearity and 
uniqueness. The statement of conservation is a generalised way to say that if a 
variable is linear then it will always be used in a linear way, or in other words that 
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linearity restrictions will always be upheld; conversely, the uniqueness theorem 
tells us that if a variable is unique then it must always have been used in a 
unique way, or in other words that it does not have multiple references. 

One important point to notice is that the additional rules (borrow and copy) 
that we include for unique types are in fact trivial cases when it comes to the 
uniqueness theorem since they can never output a value with a unique type. 
This makes sense as the idea behind these additional rules is to mediate the 
interaction between uniqueness and non-uniqueness, and this interaction can 
only ever go in the direction of producing values that are non-unique. 

A sub-result of conservation is type preservation which is complemented by 
a separate progress result in Theorem [6] to give syntactic type safety: 


Theorem 6 (Progress). Values of the heap model v are given by: 
v= (t, t2) | unit | *t| !t|Av.t|2|a|p (value terms sub-grammar) 


where p are partially-applied primitives, e.g., newArray, readArray, readArray (*a). 
Given T F t: A, then t is either a value, or if H ù Ip+TI there exists a heap H’, 
term t', usage context A, and context I” such that H H t ~ H’F t' | I’ | A. 


Finally, we see that the operational semantics, extended to full 6-reduction 
(i.e., all congruences), supports the equational theory: 


Theorem 7 (Soundness with respect to the equational theory). For all 
tı, tg such that TF t : A andT F tg: A and ti = tg and given H such that 
H XT, there exists a value (irreducible term) v and Ti, T2, A1, A2 such that 
there are full B-reductions to the same value 


HF t >s H'Hv|ND]|4 A H+ t > H'bvu| In| As 


4 Implementation 


4.1 Frontend 


The implementation of uniqueness types in Granule follows much the same pat- 
tern as the logic defined earlier. Granule already possesses a semiring graded 
necessity modality, where for a pre-ordered semiring (R, *,1,+,0,C), there is a 
family of types {OA, rer. We represent the ! from linear logic (and our calculus) 
via the pre-ordered semiring {0,1,w} (none-one-tons [30}) with !A = A." 
The semiring is defined with r+s =r ifs = 0, r+s = s if r = 0 and otherwise 
w,andr*x0=0*r=0,r*w=wer=w (forr #0), andrxl=l*xr=r 
with ordering 0 CE w and 1 Cw. This semiring allows us to represent both linear 
and non-linear use: variables graded with 1 must be used linearly, with 0 must 
be discarded, and a grade of w permits unconstrained use à la linear logic’s !. 


T It may not seem obvious that such a graded modality does exactly represent the 
behaviour of linear logic’s !, and in fact capturing the precise behaviour of ! does 
require some additional semiring structure which is present in Granule [22]. 


364 D. Marshall et al. 


a [0..%] a [0..1] 
a ‘ib i yct. : 
*a forget guarantee . ces Aon Affine coffee a T 
. of no contraction one 7 e . 
Unique ———————> Jnres arei are Linear 
weaker Relevant nr 


(non-linear) 
(non-unique) a [1..0] © 


Fig. 2: Relationship between various flavours of substructural types demonstrat- 
ing how they can all be represented using Granule’s expressive modalities. 


(In Granule, OA, can be written as the type A [Many], but we syntactically 
alias this to !A for simplicity and ease of understanding.) 


As in LCU, uniqueness is represented by a new modality, which we call x to 
match the calculus (and so that the syntax of programs involving uniqueness will 
be familiar to Clean users). The uniqueness modality wraps a value that behaves 
‘linearly’ (and so cannot be duplicated or discarded), with the key difference 
being that we provide primitive functions which allow ! to act as a relative 
monad over unique values. The primitives have the following type signatures: 


ı uniqueReturn : V {a : Type} . *a > !a -- borrow 
uniqueBind : V {a b : Type} . (ta > !b) — !a => !b -- copy 
Granule 


The uniqueReturn function here implements the BORROW rule from the calcu- 
lus (acting as the ‘return’ of the relative monad), and similarly the uniqueBind 
function implements the copy rule (acting as the ‘bind’). 


We provide syntactic sugar for both of these primitives for convenience, with 
syntax designed to evoke the rules from the LCU calculus; &x is equivalent 
to writing uniqueReturn x, while clone ti as x in t2 is equivalent to writing 
uniqueBind (Ax —> t2) tif] A simple example of uniqueness types in action is 
given below, to demonstrate the idea. 


1 sip : *Coffee —> (Coffee, Awake) 
2 sip fresh = let !coffee = &fresh in (keep coffee, drink coffee) 


Granule 


Here, borrowing (&) converts the unique Coffee value into an unrestricted one, 
so that it can be duplicated and used twice for the two separate functions. Note 
however that the uniqueness guarantee is lost in the process, so both of the 
output values are non-unique (linear, in this case). 


Figure P] illustrates the relationship between uniqueness, linearity and other 
common forms of substructural typing in the resulting system. 


8 In the implementation we use ‘clone’ rather than ‘copy’, as the name ‘copy’ is often 
used elsewhere in Granule, e.g. for the non-linear function which duplicates its input. 
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We implemented a built-in library for primitive floating point arrays in Granule, 
matching the interface for arrays of floats that was introduced as an extension 
to the LCU calculus in Section [8.3] with operations typed as follows: 


1 newFloatArray : Int — *FloatArray 
2 yvreadFloatArray : *FloatArray — Int — (Float, *FloatArray) 
writeFloatArray : *FloatArray — Int — Float — *FloatArray 
1 deleteFloatArray : *FloatArray > () 
Granule 
The writeFloatArray primitive updates an array destructively in place since we 
have a guarantee that no other references exist to the array which has been passed 
in. In the next section, we use this set of primitives to evaluate the performance 
of our implementation, by measuring the performance gains from allowing for 
in-place updates in this fashion. We have another set of immutable primitives 
akin to the above (but written with a suffix I) which work with non-unique 
arrays, e.g. readFloatArrayI : FloatArray —> Int —> (Float, FloatArray), and 
thus do not perform mutation. 
The following shows an example of clone, where a new array is borrowed and 
a copy of this borrowed FloatArray on line 3 is deleted, leaving the original (now 
immutable) instance of the array unaffected on line 4: 


1 let x = newFloatArray 10 in 
2 let [y] = &x in 

let [()] = clone [y] as y' in (let () = deleteFloatArray y' in [()]) 
1 in readFloatArrayI y 10 


Granule 


4.2 Compilation and Evaluation 


As part of our implementation of uniqueness types in Granule, as described in 
Section we also implemented a simple compiler that translates programs 
into Haskell. This compiler preserves the value types, but erases all of Gran- 
ule’s substructural types (linear, unique, graded, etc.). As a result, we can take 
advantage of both Granule’s flexible type system and Haskell’s libraries and 
optimizing compiler. For this paper, all performance results were measured by 
compiling Granule programs to Haskell, and compiling the resulting Haskell with 
GHC 9.0.1. The measurements were collected on an ordinary MacBook with a 
2 GHz quad-core Intel i5 processor and 16 GB of RAM. 

As mentioned in Section [I] one motivation for using uniqueness types is to do 
the kind of in-place mutation necessary for efficient programming with arrays. 
To check that our implementation is reasonable, we carried out an evaluation 
using an array processing benchmark. The benchmark recursively allocates and 
sums up lists of arrays of various sizes, with the goal of demonstrating the 
benefits of uniqueness types for arrays in functional programming. Each iteration 
of the benchmark allocates a list of a thousand arrays, populates the arrays with 
values, then traverses the list to sum them up. We prepared two versions of this 
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Fig. 3: Performance of mutable vs. immutable arrays in Granule. Lower is better. 


benchmark: one with functional in-place updates and manual (safe) deletion of 
unique arrays, and one with non-unique, immutable, garbage collected arrays 
and updates via copying. The overall performance of these two benchmarks is 
shown in Figure |3| with lower bars/numbers representing better performance. 
The results, while not surprising, do confirm that array-handling is generally 
more efficient when in-place mutation is allowed. Additionally, in Figure |3| we 
compared the time spent in garbage collection between the two versions of the 
benchmark. Because our implementation allocates unique data outside of GHC’s 
heap, and uniqueness types allow programmers to directly de-allocate objects in 
memory, the unique version of the benchmark spends significantly less time in 
garbage collection. For this benchmark, the unique arrays are outside of the 
garbage collected heap and directly de-allocated, while other incidental objects 
(closures, lists, and so on) are still handled by the garbage collector. 

Of course, this is a somewhat contrived benchmark. Real-world Haskell li- 
braries, for example, typically provide functional high-level interfaces for array 
manipulation while using unsafe code to mutate arrays internally. The popular 
vector library) is one example, and repa is another. Additionally, there is 
significant prior work on improving the efficiency of functional programs oper- 
ating on arrays (for example, using combinators like map and fold along with 
aggressive fusion [13125]27)), which we will not dwell on. The main point is that, 
at some stage, arrays must be mutated. Rather than having this happen through 
unsafe code, or via external C or Fortran, uniqueness types give us a way to do 
that mutation directly in our functional language, efficiently and safely. 

Crucially, in these comparisons, all versions of the programs are implemented 
in the same language: Granule. With our extensions, the language is expressive 
enough to encompass a variety of programming approaches. Functional program- 
mers may freely mix and match from a variety of options for data management 
and manipulation. Object lifetimes may be either manually or automatically 
managed, and object contents may allow in-place mutation or be immutable. 


® https: / /hackage.haskell.org/package/vector 
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5 Related Work 


Uniqueness types are most well known for their appearance in the Clean lan- 
guage [40}[47], where they are used in lieu of monadic computation and for 
the efficiency gains offered by in-place update. In Clean, computation is based 
on graph rewriting and reduction; constants such as numbers are graphs, and 
functions are graph rewriting formulas. This gives the type system a rather dif- 
ferent feel to those offered by more recent functional programming languages, 
and makes it more difficult to capture the benefits of Clean-style uniqueness in 
a modern setting, hence the value in our pursuit of this goal. 

Some theoretical groundwork for Clean’s uniqueness types has certainly been 
developed over the years, particularly in works by de Vries among others [5556]; 
these papers aim to clarify the distinction between Clean’s type system and 
systems based on the A-calculus. Further work makes headway on the problem of 
distinguishing uniqueness from other substructural systems (53]54). This follows 
a similar theoretical approach to the one demonstrated in our paper; such ideas 
for limited settings inspired the groundwork for our system, which is more general 
and has a practical implementation. 

Other languages (old and new) featuring uniqueness types include Single- 
Assignment C [45], Mercury and Cogent [85]. 

Ownership was first developed as a framework for understanding aliasing 
in object-oriented languages [84], and is intended to give a high-level structural 
view of objects and references in much the same way that powerful type systems 
give a high-level structural view of data. Ownership is now most familiar due 
to being pervasive in the Rust programming language, for which multiple for- 
malisations have been attempted; RustBelt gives a lower-level encoding of 
Rust intended for formal verification while Oxide is a higher-level encoding 
designed for more theoretical work, among others |39]. Extending these ideas to 
other languages is an active area of research; RefinedC is one example. 

Regions have been used over the years in the context of effect systems 
[26]. One of the primary motivations of research into region types was their 
application in region-based memory management 51], which aimed to bring some 
of the benefits of traditional stack-based memory management to higher-order 
functional languages. Regions divide values based on their lifetimes, so a system 
with region types can safely allocate and de-allocate memory for values based 
on region type information, eliminating the need for garbage collection. 

Early on, regions were restricted to have LIFO (last-in, first-out) lifetimes 
which followed the block structure of a language, but later work relaxed this 
constraint using uniqueness (see: static capabilities and Cyclone 21); a 
unique reference to a region ensures there are no aliases to the region, and that 
it can therefore be promptly de-allocated. Additionally, regions themselves act 
as a way to control aliasing, and can be thought of as equivalence classes for a 
“may alias” relation—in other words, values which do not share a region may not 
alias with one another, and so if a value does not share a region with anything 
else then it may be safely mutated in place. 
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Work on Cyclone demonstrated the relationship between regions and 
unique pointers, observing that “unique pointers are essentially lightweight, dy- 
namic regions that hold exactly one object.” Beyond that, Rust’s lifetimes are 
heavily based on regions, and there exists an extension of ML called Affe 
which aims to support both linearity and borrowing using regions. 

Capabilities are tokens that a function must possess in order to be able to 
access a particular location in memory. Capabilities are linear, and cannot be 
duplicated or discarded, in order to prevent them from being forged |17]. Im- 
plementations exist for various object-oriented languages such as Java and 
Scala (19); more functional languages taking inspiration from the idea of capabil- 
ities also exist (33)[41]. Recent work on linear constraints for Haskell [49], which 
hopes to allow for something similar to borrowing within the framework of linear 
Haskell, also descends from work on capabilities. Ambient capabilities can also 
be internalised as a comonad to capture purity within an impure language [12]. 


6 Future Work 


Ownership via fractional permissions Though Granule can now represent 
values with both linear and unique types, the language allows for much more 
fine-grained analysis of resourceful data via grading. For instance, we can replay 
our earlier non-linearity example but with some extra information in the types: 


1 accurate : Cake [2] — (Happy, Cake) 
2 accurate [cake] = let extra = have cake in (eat cake, extra) 
Granule 
Instead of an infinite amount of cake we specify that we have exactly two cakes; 
the cake on the right-hand side must be linear as we only have one usage re- 
maining. If we used the input three times we would receive a type error. 

Given that we can move beyond the simple binary view of linear and non- 
linear, one might suspect that we could track the quantity of existing references 
to a value more accurately than just unique or non-unique. We propose taking 
inspiration here from Boyland’s notion of fractional permissions [8]. 

The purpose of fractional permissions is to allow multiple readers to access 
the same resource without losing the ability to later gain unique write access. A 
“permission” can be split up, allowing read-only access to multiple consumers, 
and then later recombined (while ensuring no other permissions still exist). 

To relate these with our calculus, let us hypothesise that *; P is a ‘com- 
plete’ unique value that we can read from or write to, and that we can split up 
arbitrarily into ‘fractionally’ unique values xn P where 0 < n < 1, as follows: 


*n P8 *m P 9 *nimP 


As with fractional permissions, fractional values must only be used for behaviour 
that does not involve mutation, because whilst a value is only fractionally unique 
we cannot guarantee that other references do not exist. We should only regain 
this ability if we recombine the guarantees into a complete *; P. 
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This model closely resembles ownership as in Rust — we can think of a value 
of type *n P for n < 1 as being equivalent to a Rust-style & P which is a borrowed 
value that we cannot mutate["| When a value has been borrowed the original 
value cannot be written to until we are finished with the borrows, much like we 
would need to collect all the fractionally unique values back together to get back 
to our original unique *; P. Being able to more closely model Rust’s powerful 
ownership system would make this a fruitful avenue for future research. 


Linear Haskell Granule’s linear basis and assortment of modalities allows for a 
particularly natural embedding of the LCU calculus, but this does not preclude 
the theory of this paper from being applied in other contexts. One particularly 
valuable setting to consider would be Haskell, which as of GHC 9 already has 
linear types based on an underlying graded system called A¥,. 

Haskell’s graded representation of linearity involves function types (a %r -> b) 
which have a multiplicity annotation r; at present, this can be either ' One (linear) 
or 'Many (unrestricted). But A%, is designed to be extensible, and the possibility 
of introducing additional multiplicities is welcomed mgo]. 

The original paper on linear Haskell |7| mentions that “linear types are con- 
ceptually simpler than uniqueness type systems, giving a clearer path to im- 
plementation in GHC”, and also that “functional languages have more use for 
fusion than in-place update”. Our clarification of the relationship between lin- 
earity and uniqueness demonstrates that not only are uniqueness types no more 
complex conceptually than linear ones, they can comfortably sit alongside one 
another in a single calculus; our evaluation demonstrates that while linearity is 
certainly useful, there are still further practical benefits to be gained from intro- 
ducing uniqueness into a language with linear types. Perhaps these contributions 
will begin to forge a path towards a future for Haskell where linear types and 
uniqueness types can both be leveraged for their respective strengths. 


Adjoint models Benton’s linear/non-linear (LNL) logic (6) consists of two 
fragments: intuitionistic (non-linear) logic ® kz X and a mixed fragment of 
intuitionistic linear logic with non-linear hypotheses ®, l He A. These two frag- 
ments are connected by a pair of modalities Lin(X) and Mny(A), which form 
an adjunction; the ! modality can then be recovered by !A = Lin(Mny(A)). 

Breaking the ! modality into two and allowing linear logic to be mixed with 
non-linear logic has been a valuable endeavour, and so a natural question is 
whether it is possible to build an LNL-style adjoint model for our unified LCU 
calculus. It seems plausible that building an adjoint model for just uniqueness 
logic would not be too difficult; this would be very similar to the LNL model but 
with the adjunction moving in the opposite direction, and the monadic modality 
o from uniqueness logic could be represented in much the same way that the 
comonadic ! can be recovered in LNL. 


10 Rust also includes mutable borrows, which allow the borrower to both read from and 
write to their borrowed reference. These are a much closer analogue to our current 
non-fractional calculus, since mutable borrows must be unique. 
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An adjoint model for the full LCU calculus would be more interesting. This 
would most likely involve three fragments, two of which would be symmetric 
monoidal categories (for unique and linear values) and one of which would be a 
Cartesian closed category (for unrestricted values), with two adjunctions allow- 
ing values to flow from unique to unrestricted to linear as we might hope. 


Ordered and dependent types As expressive as Granule’s type system may 
be, there are opportunities for enforcing stronger properties on programs else- 
where in the landscape of type theories. One possibility is that in addition to 
restricting contraction and weakening, it is also possible to restrict exchange, 
giving ordered type theories which correspond to noncommutative logic. 

Such systems can be used to model stack-based memory allocation (as op- 
posed to heap-based), since without exchange an object may only be used when 
it is at the top of the stack {10]62]. But much like linearity, these systems restrict 
the use of exchange in the future; is there an equivalent of uniqueness for ordered 
types which guarantees that exchange has never been applied in the past, and 
could this be useful for tracking references on the stack? 

Another possibility is to bring uniqueness into the realm of dependent types. 
Recent work on graded modal dependent type theory (GRTT) allows for 
capturing requirements on variable usage at both the type and computation 
levels; grades come in pairs, where the first component is the computation-level 
grading and the second component is the type-level grading. Strictly linear usage 
in types is rare — is there value in being able to represent uniqueness here? 


7 Conclusion 


Linearity and uniqueness are both well-studied concepts with similar substruc- 
tural foundations, but differing benefits; linearity enables the careful manage- 
ment of resourceful data, while uniqueness offers the possibility of safe in-place 
updates. By formalising the relationship between these two ideas, building on 
two distinct bodies of literature, we have shown that there is value in having 
both linear and unique types in the same type system. This could be a first step 
on the road towards properly understanding the relationships between more ad- 
vanced substructural type systems, such as the fine-grained resource tracking of 
Granule and Idris and the complex memory management provided by Rust. 

Moreover, we implemented this system in the graded modal setting of the 
Granule language and provided benchmarks to demonstrate the efficiency gains 
that can be accessed via adding uniqueness to a language that already has a linear 
basis. The opportunities to incorporate uniqueness types into languages outside 
of Granule are apparent, and this paper offers both a theoretical underpinning for 
uniqueness as it relates to linearity as well as clear validation of the performance 
benefits that a system which unifies linearity and uniqueness can offer. 
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Abstract. Mechanisation of programming language research is of grow- 
ing interest, and the act of mechanising type systems and their metathe- 
ory is generally becoming easier as new techniques are invented. However, 
state-of-the-art techniques mostly rely on structurality of the type system 
— that weakening, contraction, and exchange are admissible and vari- 
ables can be used unrestrictedly once assumed. Linear logic, and many 
related subsequent systems, provide motivations for breaking some of 
these assumptions. 

We present a framework for mechanising the metatheory of certain sub- 
structural type systems, in a style resembling mechanised metatheory of 
structural type systems. The framework covers a wide range of simply 
typed syntaxes with semiring usage annotations, via a metasyntax of 
typing rules. The metasyntax for the premises of a typing rule is related 
to bunched logic, featuring both sharing and separating conjunction, 
roughly corresponding to the additive and multiplicative features of lin- 
ear logic. We use the uniformity of syntaxes to derive type system-generic 
renaming, substitution, and a form of linearity checking. 


Keywords: Formalised syntax - substructural types - mechanised metathe- 
ory - quantitative typing 


1 Introduction 


In this paper, we treat the metatheory of a class of substructural type systems 
related to linear logic [11]. This class is variously known as coeffectful [17, 18], 
quantitative [4, 7], or resource-aware [10], or is given no particular name [1, 19], 
and generalises bounded linear logic to track variable usage with semiring an- 
notations. In all of these systems, we have some ambient semiring 2, and in 
the judgements of the type system, variables are annotated by elements of 2 
describing how that variable can be used. The additive structure of & gives the 
ability to count, or otherwise accumulate, usages of variables in multiple sub- 
terms. The multiplicative structure gives rise to a form of modality, for example 
allowing multiple or unlimited reuse, or movement between security levels, in 
the type system. 
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The aspect of such systems we tackle here is their basic metatheory and 
mechanisation thereof. 

We build upon both the general structural framework of Allais et al. [3] 
and the substructural techniques of Wood and Atkey [21]. The way Allais et al. 
consolidate and codify mechanisation techniques for propositional natural de- 
duction systems based on intrinsically typed syntax and de Bruijn indices, we 
aim to replicate for linear-like systems based on semiring usage annotations. By 
picking a trivial semiring, our work can subsume that of Allais et al., except for 
the many pieces of machinery we have not yet ported to this new framework. 

Our work complements that of Orchard et al. [17] on the Granule program- 
ming language. Where Granule focuses on writing programs in the language and 
running them, we focus on metatheoretic reasoning about type systems. 

Our work is similar in scope to that of Licata et al. [13], though we work 
in a natural deduction style rather than a sequent calculus style. Where Licata 
et al. are much more agnostic in terms of substructurality — allowing for non- 
commutative and bunched logics — we are much more agnostic in terms of 
syntax. The system of Licata et al. is essentially a single calculus, supporting 
“product” (F) types and “function” (U) types, parametrised on a mode theory 
describing its structural rules. For this system, they derive the strong result of cut 
elimination. Meanwhile, we leave syntax design to the user, and consequently can 
only guarantee substitution (which we can only get because of our commitment 
to natural deduction). 

This paper proceeds as follows. In section 2, we review and fix conventions 
pertaining to partially ordered semirings and vectors over them. In section 3, we 
introduce an informal meta-syntax allowing us to state substructural typing rules 
succinctly and without explicit reference to contexts. In section 4, we mechanise 
that meta-syntax, giving a type of descriptions of type systems, and interpreting 
those descriptions as types of intrinsically typed terms. In section 5, we discuss 
usage-aware environments: a generalisation of the structures used in simulta- 
neous renaming and substitution proofs. We use environments in section 6 to 
state an alternative elimination principle for terms, and give examples of such 
eliminations in section 7. The examples are syntax-generic renaming and substi- 
tution, a specific denotational semantics, and a syntax-generic usage elaborator. 
Finally, we conclude and discuss future work in section 8 

The work presented in this paper has been mechanised in Agda, with the 
code available for building upon [22]. 


2 Vectors over semirings 


The basic algebraic structure we deal with is partially ordered semirings, or 
posemirings for short. A posemiring is a (not necessarily commutative) semiring 
on a partially ordered set, where both operations are monotonic. As in many 
similar formalisms, posemiring elements represent usage restrictions, with ad- 
dition collecting restrictions from multiple uses, multiplication handling usage 
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under a modality, and the order giving subsumption of restrictions, comparable 
to subtyping. 


Definition 1. A posemiring is a tuple (@,<,0,+,1,*) such that (@,<) is a 
partially ordered set, (#,0,+) is a commutative monoid, (#,1,*) is a monoid, 
+ and x are monotonic, and x distributes over 0 and + on both sides. 


Example 1 (Zero-one-many). The poset {0 > w < 1} forms a posemiring under 
normal numeric addition (with 1 +1 = 1 +w = w +w = w) and multiplication 
(with w xw = w). This gives us a way to mark whether variables are unused 
(0), used linearly (1), or used unrestrictedly (w) in the current (sub)term. The 
ordering says that unrestricted-use variables can also remain unused or be used 
linearly. 


Example 2 (Variance). The set {~~, 1f, 44,??}, with ~~ at the bottom and 
?? at the top of the order, forms a posemiring with addition being meet, 0 being 
top (??), 1 being Îf, and multiplication being commutative and determined 
by H * {44 = M and ~~ x J4} = ~~ x w~ = ~w. This gives us a way to 
track the variance with which variables are used, in the aim of all terms being 
monotonic in their free variables. 1f stands for covariance, || for contravariance, 
~~ for invariance, and ?? for a variable with no guarantees, in which we must 
be constant. 


An element of a chosen posemiring # describes the usage restrictions on a 
variable. Therefore, a vector of elements from & describes the usage restrictions 
of a whole context’s worth of variables. From the posemiring operations of 2, 
we derive the standard vector operations of zero, addition, and multiplication 
by a scalar. We can also form the standard basis vectors at any given dimension. 
From the order on &, we get a pointwise order on vectors. 

Vectors of a given length form a module over the posemiring Z, analogously 
to how vectors over a field form a vector space. The partial order on such vectors 
is pointwise. 


Definition 2. A (left) module over a posemiring, given a posemiring Z, is a 
partially ordered commutative monoid (M,0m,+m) with, for each r € &, a 
pomonoid morphism r - (—) : M — M, such that the collection of these respects 
the posemiring structure on r. Specifically, for all instantiations of the variables: 


—Ifr<r' andu<w, thenr-u<r’-u. 
—r-Ou =0m andr: (utyv)=r-utmr:v. 
— 0-u=0y and (r+s)-w=r-uty s-u. 

— 1-u=u and (rxs)-w=r-(s-u). 


We care to define modules so as to define module morphisms, also known 
as linear maps, which we use extensively when relating two contexts (as we 
do, for example, in simultaneous substitution). For the sake of mechanisation, 
we choose to define module morphisms relationally rather than functionally, 
giving a somewhat unfamiliar-looking definition that is equivalent to the usual 
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functional definition. The main advantage of this relational approach is that 
proofs of relatedness for typical linear maps compose and decompose via data 
constructors and pattern matching. 


Definition 3. A (relational) linear map WY between modules M and N over a 
posemiring Z is a relation ~ on the underlying sets of M and N satisfying the 
following laws (with — standing for implication and quantifiers binding most 
loosely). 


— Vu, u,v, v. u Lu ou! Lva uvwv >u nd’ 

— W. (Ju.u <0 A u~v)>v<0 

— Vuo, u1, v. (Ju. u < uo tur A u ~v) —> 

(Avo, v1. uo ~v A U~ vy, A v < vo +v) 

— Yr, u',v. (Ju. u <r A u ~v) > (Iv. u ~v! A v< ro’) 
— Vu. dvu.unv A W. uwv >v <u 


Intuitively, Q ~ P, where P and Q are row vectors, is equivalent to P < QY, 
where W is the matrix representing the linear map and on the right is a vector- 
matrix multiplication. It is important that we think of row vectors and right- 
multiplication by a matrix because, without commutativity of the underlying 
posemiring, we can only expect (rQ)W = r(QW) and not Y(rQ) = r(WQ). In 
section 5, we use the matrix notation for convenience, while in the Agda code 
we see Y rel P Q. 


3 Bunched Typing Rules 


We now let # be an arbitrary posemiring. Our framework represents well typed 
and &-usaged terms intrinsically. Intrinsic typing means that we represent well 
typed and &-usaged terms (and only those) as inhabitants of an inductive family 
Ry F A indexed by usage context R, type context y, and type A. We represent 
the shape of a context as a nullary-binary tree, with typing and usage contexts 
being functions that assign types and elements of 2, respectively, to leaves of 
the tree. Using trees instead of lists for typing contexts has the advantage that 
extension of a context by multiple variables does not lead to complex count- 
ing arguments to access the pre-existing variables, because context extension is 
(judgementally) injective. However, these precise details will eventually become 
irrelevant, as we will be able to use simultaneous renaming to smooth over any 
structural differences between contexts. 

Figure 1 presents a prototypical example of a system that our framework 
can represent, which is a subsystem of the AR system of Wood and Atkey [21]. 
Each rule is given as a constructor: the premises are named p, s, t, etc., and 
the conclusion is a constructor applied to those metalanguage variables. Object 
language variables are represented intrinsically as members of the type Ry 3 A, 
which is a proof that the type A appears in the typing context, i : y 3 A, together 
with a proof that R < (i|. Expanding the vector notation, the latter condition 
says that the selected variable į must have a usage annotation < 1 in R, while 
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z:RYIA 
varz: RyF A 


s: PyF A — B 
p:R<P+Q t:QyFA 
—Epst:Ry+B 
p:R<P+9Q2 t: Qy, 1AF C 
s:PyFAGBB u:Q7,1BFC 
DE pstu:RyF C 


t: Ry,1AF B 
—l t: RyF A — B 


t: RyF Ai 
Sl; t: Ry F Ao @ Ai 


s:PyF'rA 
RES : 
PERS VITA Cpe tOr AL CO 
7 | 
pt: Ry !rA IEpst:RyFC 


Fig. 1. A prototypical posemiring-usaged system 


all other variables must have a usage annotation < 0. We use the constructors 
ZY and N to describe a path down the nullary-binary tree, terminated by the 
word here. The var rule imports variables into terms. 

The remaining rules are the introduction and elimination rules for three type 
constructors: —ol and —cE for function types A —o B where the bound variable 
is annotated with 1 for “use once”; l and GE for sum types A @ B; and !I and 
IE for a Z-annotated exponential modality !rA. 

There are two key observations to make about this system, which will guide 
the way we build our generic framework for -annotated substructural systems: 


1. Every rule repeats the typing context y throughout its premises and con- 
clusion. The only time the typing context is modified is to add additional 
variables in the rules that bind fresh variables (—l, GE, !E). 

2. Rules with multiple typing premises must describe how the usages of the 
conclusion (always denoted R) are distributed across the premises. In the 
—oE rule, the usages are separated into two parts P and Q for the premises. 
This is an example of a multiplicative rule in the terminology of Linear Logic 
[11]. In the $E we see an example of an additive rule, where the usage context 
Q is shared between the premises t and ut. The !I rule uses scaling by r of 
the usages of the premise. 


These observations indicate a way to regularise and streamline the presenta- 
tion of this system. Instead of treating each premise and the conclusion as having 
potentially unrelated typing and usage constraints, we make use of combinators 
for combining premises that will relate their usage and typing contexts to the 
conclusion by construction. This idea comes from the work of Rouvoet et al. 
[20], including the => and -* connectives we use later. To handle binders, which 
introduce variables, we make use of a combinator that adds a variable with a 
given -annotation to an ambient context, without having to explicitly mention 


1 There is an unfortunate clash of terminology here: multiplicative rules add their 
usage contexts, while additive rules share their usage contexts. 
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the parts of the context that have not changed. This technique is already present 
in some paper presentations of type systems, and is formalised by Allais et al. 
[3]. To manage how usage annotations are distributed between premises, we use 
the separating (x) and sharing (x) conjunction connectives from Bunched Im- 
plications [16]. To handle the !I rule, we will need a scaling modality, r-—. The 
semantics of the bunched connectives we will use in this paper are: 


iR:=1 
TRUYR=TRRUR 
(T>U)R:=TROUR 
R:=R<O0 
(T*U)R:= VP,O.(R<P+9)xTPxUQ 
(T «U)P :=1190,R. (R<P+9O)9~TQO OUR 
(r-T) R := XP. (R< rP) xTP. 

The function connectives > and — are not used in typing rules, but are used 
in the rest of the framework (though one can interpret the horizontal line in a 
typing rule as > plus universal quantification). An important point to note is 


that bunched combinators induce linear combinations of substructures, in the 
sense of the linear algebra of posemirings described in the previous section. 


ve: aA t:1AF B (t: F A— B) * (s: FA) 
varz: FA —lt: F A— B —Ets: HB 
t: FA; (s: ASB) * ((t:1AFC) x (wu: 1BEC)) 
ol, t: F Ao @ Ai Estu: FC 
t:r-(F A) (s: FA) * (t:rAF C) 
Wt: FIA Esti HGC 


Fig. 2. The prototypical system of figure 1 restated in terms of bunched combinators. 


Figure 2 shows our prototypical system restated with implicit contexts and 
the bunched combinators. The inductive family is now denoted F- A, only men- 
tioning context extensions, as we do in the rules —l, E and !E. Thus, in the 
var rule, the context is completely suppressed. The —ol rule just has to state 
that a new variable with usage annotation 1 and type A is added to the con- 
text. The —E rule uses the separating conjunction (x) to combine the premises, 
indicating that the usages of the two premises are added together for the conclu- 
sion. The @E rule demonstrates the sharing conjunction x: the scrutinee term s 
and the clause terms t, u are combined by separating conjunction, because their 
usages must be combined, but the clause terms are combined by the sharing 
conjunction, because they have the same usage context. 

Bunched combinators, along with suppression of unchanged typing contexts, 
leads to a more streamlined presentation of the system without the clutter of 
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explicit usage context inequalities. However, the larger advantage for us is that 
systems are constructed using these combinators automatically admit renaming, 
substitution, and other scope-, type-, and usage-safe traversals. If we were to 
allow arbitrary modification of the context in premises, these results would not 
be possible, since there would be no guarantee that a substitution (for instance) 
could be “pushed” up from a conclusion to the premises. As we will see in 
section 5, our generic notion of environment (e.g., a simultaneous substitution) 
is based around linear transformations, and so automatically commutes with the 
linear combinations of premises induced by the bunched connectives. This is the 
key to our generic results for all of the systems describable in our framework. 


4 Generic syntax 


We take the insights of the previous section and use them to build a generic 
framework for posemiring-annotated substructural systems in Agda. We will first 
show descriptions of systems, which are comprised of rules that have premises 
combined using the bunched combinators. We then show how to construct the 
Agda data type of intrinsically well scoped, typed, and resourced terms for any 
given system of our framework. We use the prototypical system from figure 2 as 
a running example. Section 4.3 presents further examples that our framework 
can express. 

We now start to use Agda notation for record and data type declarations, to 
emphasise that our framework has been implemented. 


4.1 Descriptions of Systems 


A type System is made up of multiple Rules. Each Rule comprises a Premises and 
a conclusion type. We assume that there is a Ty : Set of types for the system in 
scope. 

The Premise data type describes premises of rules, using the bunched combi- 
nators from section 3. A single premise is introduced by the (_'t_) constructor. 
This allows binding of additional variables A (with specified types and usage 
annotations) and the specification of a conclusion type A for this premise. The 
remaining constructors are descriptions for the bunched connectives. 


data Premises : Set where 
(H) : (A : Ctx) (A : Ty) > Premises 
‘i : Premises; _'x_: (p q: Premises) — Premises 
‘T* : Premises; _'x_: (p q: Premises) —> Premises 
_._:(r: Ann) (p : Premises) — Premises 
A Rule is a pair of some Premises and a conclusion. We use an infix arrow as 
a suggestive notation for rules. 


record Rule : Set where 
constructor _—=>_ 
field premises : Premises; conclusion : Ty 
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Finally, a System consists of a set of rule labels (i.e., constructor names), and 
for each label a description of the corresponding rule. We use > as infix notation 
for systems to associate the label set with the rules. 


record System : Set, where 
constructor >_ 
field Label : Set; rules : (J: Label) — Rule 


As an example, we transcribe the system defined in figure 2 into a descrip- 
tion. We give the set of types of this system as a data type Ty (together with a 
base type v). We assume that there is a posemiring Ann in scope for the anno- 
tations.There is one label for each instantiation of a logical rule, but the labels 
contain no further information about subterms or restrictions on the context. 
This will be provided when we associate labels with Rules in a System. 


data Ty : Set where data ‘AR : Set where 
Ls Ty ‘ol ‘-0E : (A B: Ty) > ‘AR 
—o_ --: (A B: Ty) > Ty ‘el: (i: Side) (A B : Ty) > ‘AR 
! : (r: Ann) (A: Ty) > Ty ‘®E: (A BC: Ty) > ‘AR 


‘Il: (r: Ann) (A : Ty) + ‘AR 
A f ‘IE: (r: Ann) (A C: Ty) > ‘AR 
data Side : Set where Il rr : Side 


To build a system, we associate with each label a rule: 


AR : System 

AR = ‘AR > A where 

(IAB) 7 ([1#-A]o + B) = (A - B) 
(EAB) > ((]° H A B) (°A) =B 
(rA) > (r'- (FA) => (I rA) 
(IErAC) >K FErAJ [rAr F C=C 
Cel ll A B) => ( [Je + A) => (A 6 B) 
(Slr AB) > ( [Jo + B} => (A 6 B) 
(®EABC)> 


(M PAG se) (TIZ AP Pe) (ie 8] oO) g 


Compared to figure 2, modulo the Agda notation, we can see that the fun- 
damental structure has been preserved: the rules match one-to-one, and the 
bunched premises are the same. A major difference is that we do not include a 
counterpart to the var rule in a System. Variables are common to all the systems 
representable in our framework. 


4.2 Terms of a System 


The next thing we want to do is to build terms in the described type system. 
The following definitions are useful for talking about types indexed over con- 
texts, judgement forms, and judgement forms admitting newly bound variables, 
respectively. 
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OpenType : V Z — Set (suc 4) 
OpenType / = Ctx —> Set £ 


OpenFam : V ¢ — Set (suc £) 
OpenFam £ = Ctx > Ty > Set £ 


ExtOpenFam : V ¢ > Set (suc 4) 
ExtOpenFam Z = Ctx —> OpenFam / 


To specify the meaning of descriptions, we assume some X : ExtOpenFam, 
over which we form one layer of syntax, using the function |_]p that interprets 
Premises defined below. The first argument to X is the new variables bound 
by this layer of syntax, as exemplified in the first clause of [_]p. The second 
argument is the context containing the variables being carried over from the 
previous layer. Notice that this is not, in general, the same as the context from 
the previous layer, because the usage annotations may have been changed by 
connectives like _'x_ and _'-_. The third argument is the type of subterm required. 

The remainder of the clauses of [_]p are the interpretation into bunched 
combinators. The superscript © on the bunched connectives denotes that they 
have been lifted from predicates on usage vectors to predicates on contexts, with 
the type component of the context shared throughout. Additive connectives i 
and x are already polymorphic (not relying on anything specific about usage 
vectors), so do not need a © variant. 


[-Jp : Premises + ExtOpenFam ¢ —> OpenType 4 

[( At A)]pxr=xXATA 

[‘i]pX=i; [p'x a]pX=[p]pXx[a]pX 
[‘J* ]p X="; [p'*q]pX=[p]pX*° [| q]px 
[vr p]pX=r-°[p]px 


The interpretation of a Rule checks that the rule targets the desired type 
and then interprets the rule’s premises ps. Notice that the interpretation of the 
premises is independent of the conclusion of the rule, which accounts for the use 
of OpenType in [_]p versus OpenFam in [_]r. 


[Jr : Rule + ExtOpenFam £ > OpenFam £ 
| p = Ar XT A=A=AX [ds |pXr 


The interpretation of a System is to choose a rule label / from L and interpret 
the corresponding rule rs / in the same context and for the same conclusion. 


[_Js : System — ExtOpenFam £ > OpenFam £ 
[Lors]JsXCDA=Z[leEL)][ rsl]rX0 A 


The most obvious way to make such an X is to use some existing OpenFam 
on an extended context. We defined Scope to do this: take the new variables A, 
concatenate them onto the existing context J’, and pass the extended context 
onto the judgement 7. 
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Scope : V {4} — OpenFam ¢ — ExtOpenFam ¢ 
Scope TAT A=T(I'++°A)A 


We use Scope to deal with new variables in syntax. Terms resemble the free 
monad over a layer-of-syntax functor, though that picture is complicated by 
variable binding. A term is either a variable or a use of a logical rule together 
with terms for each of the required subterms. The Size argument is a use of 
Agda’s sized types to record that subterms are smaller than the surrounding 
term for the termination checker. 


data [_,_]/_ (d : System) : Size + OpenFam 0¢ where 
‘var : V[ -3- > [d,f sz]-_] 
‘con: V[ [ d]s (Scope [ d, sz]+_) >[d,f sz]+_] 


This definition uses >, which, analogously to x, is an index-preserving ver- 
sion of the function space. We take —-> to handle n many indices — in this case 
two (the context and the type). The notation V| T ] stands for V {x ... £n} > 
T x... £n, where T is a type family with n many indices. 

Terms in this data type are difficult to write by hand, due to the need for 
proofs that the usage contexts are handled correctly. For example, the following 
term is needed to show that, in the {0,1,w} (linearity) posemiring of example 1, 
lw forms a comonad. Pattern synonyms —l, !E’, and !I’ stand for applications of 
‘con, with the latter two taking explicit usage contexts and proofs. On concrete 
posemirings (as in this example), unification is particularly poor at inferring the 
usage contexts from the proofs because addition and multiplication are no longer 
(judgementally) injective. The function var# is a way of turning a statically 
known de Bruijn level and a usage proof into an application of ‘var. 


cojoin-lw : V A => [AR , œ ] [F F (! wH A — ! wH (! w# A)) 
cojoin-!w A = 
—ol (1E" (++ [1# J) (I ++ [0# 1) (On ttn [ <refl Jn) 
(vard# 0 (([In +n [ -refl Jn) ++ În) 
(WV (++ LOH J) ++ [wt D) 
(Dn +n [ <-refl Jn) ++n [ <refl Jn) 
C (C ++ [0# ]) ++ [w# ]) ++ D) 
(((In ++n [ <-refl Jn) ++n [ <-refl ]n) n Ihn) 


(n ++n [S-refl In) +n [St Jn) ++n In) ++n [n))))) 


Writing terms like this is clearly unsustainable. We will see a way of automat- 
ing the necessary proofs via a System-generic elaborator in section 7.2. 


4.3 Other syntaxes and syntactic forms 


The system pj. We can encode a usage-annotated version of System L/the uñ- 
calculus [8] — a syntax for classical logic — in such a way that contexts capture 
the undistinguished parts of the sequent. As such, the generic substitution lemma 
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we get in section 7.1 is the form of substitution required in standard pj-calculus 
metatheory. Though the pji-calculus is originally described as a sequent calcu- 
lus [8], we use the techniques of Herbelin [12, p. 12] and Lovas and Crary [14] 
to present it as a natural deduction system, thus giving a notion of variable to 
the system. 

Unlike the single judgement form of AR and standard simply typed A-calculi, 
the uñ-calculus has three judgement forms: terms, coterms, and commands. 
Read logically, terms and coterms are seen to, respectively, prove and refute 
propositions (types), while commands exhibit contradictions. This means that 
the abstract Tyin the generic framework is instantiated to Conc (for conclusion) 
as below, with Ty not being exposed directly to the generic framework. For now, 
we just consider multiplicative disjunction ? (par) and negation/duality, beside 
an uninterpreted base type. These are enough to exhibit classical behaviour. 


data Ty : Set where data Conc : Set where 
base : Ty com : Conc 
3_:(rA sB: Ann x Ty) > Ty trm cot : (A : Ty) > Conc 


-^L : (A : Ty) > Ty 


With Ty instantiated as Conc, all terms are assigned Conc type, as are all the 
variables. No variables are given com type, similar to how in the bidirectional 
typing syntax of Allais et al. [3, p. 25], no variables are given Check type. How 
to observe this invariant is covered in the latter paper, so we will not repeat it 
here (having not yet seen how to write traversals on terms). 

The syntax comprises a cut between a term and a coterm of the same type, 
the eponymous u and ñ constructs for proof by contradiction, and then term 
and coterm (introduction and elimination) forms for negation and par. 


data ‘MMT : Set where 
‘cut ‘u ‘u~ : (A: Ty) > ‘MMT 
‘'A~ 2 (A : Ty) > ‘(MMT 
‘(--) 'ul-,-) : (rA sB : Ann x Ty) > ‘MMT 


MMT : System 
MMT = ‘MMT bc A where 


(‘cut A) > ( [JF ‘H trm A ) ‘x ( []® 'F cot A) = > com 

(uA) > ( [1# , cot A ]° ‘+ com) = tm A 
('u~ A) > ( [ 1# , trm A ]° ‘+ com ) => cot A 

(A A) >([ cot A) => trm (A *1) 
(A~ A) > ( [JS  trm A) = cot (A ^L) 


(‘(,-) rA@(r, A) sB@(s , B)) > 
Pal {Jf ‘F cot A) ta ([]f‘F cot B) => cot (rA T sB) 
(u.-) rA@(r, A) sB@(s , B)) > 
( [r, cot A ]° ++° [s , cot B]° com) = trm (rA ¥ sB) 
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Duplicability There is one more bunched combinator we have experimented with 
adding to the framework: 


(OT) R = OR’. (R! < R) x (R! <0) x (R'S R+R)x TR’ 


The idea of (IT) R is to assert that R, or some refinement of it, can be both 
discarded and duplicated indefinitely, and in the refinement we have a T. We 
use this combinator to introduce subterms that are used an unknown number 
of times, for example the continuations of the eliminator of an inductive type, 
or other fixed points. We can also use it in linear/non-linear style systems [6] to 
make sure linear variables are not available in the intuitionistic fragment. 

Adding the O combinator is the only thing we have found that requires our 
linear maps be functional rather than merely relational. 


5 Environments 


We have now seen how to build data types of intrinsically well typed and well 
usaged terms for a given System. In the next section, we will define a generic 
traversal function that assigns a “semantics” to each term. Traversals operate 
on open terms, so they need a way to assign semantics to variables in a typed 
and usage respecting manner. This is the function fulfilled by environments. 
Given a semantic notion of variable V : OpenFam, we use the notation I” = A 
meaning VI’ A for the type of inhabitants of V in the context I’ at type A. In 


the non-substructural systems of Allais et al. [3], a V-environment I & Ais 


nothing more than a function YA > A3 A >T 2 A, mapping variables to 
V-things. In our usage annotated setting though, we must correctly distribute 
resources tracked by the annotations; making sure that we have enough resources 
in I to cover all the demands in A. Following our previous work [21], this 
accounting is expressed via the presence of a linear transformation: 


Definition 4 (Environment). A V-environment between annotated contexts 
I and A (decomposed as Py and Qô, respectively, when convenient) is a linear 
map Y : #\4| — A| (written postfix) such that P < QW and for each A, P’, 
and Q' such that P' < O'W, a “lookup” function from Q'S 3 A to P'y YA. 


In Agda code, we use [ V ] I’ E A instead of T Z A and [V] T =° A instead 


of r È A. 

The specification of the lookup function has some redundancy. Notice that, 
for Q'A 3 A to hold, we must have Q’ < (i| for some i. Instead of P’ < 
Q'W, asking for P’ < (i|/W would be just as general. Additionally, all of the 
Vs we consider satisfy the subusaging property (that P’ < P yields a coercion 
Pr Š A > Pr È A), in which case we could just ask for an inhabitant 


f Vv dace : ; 
of (({|W)y -= A. However, we find the stated definition technically expedient 
because, by this point, basis vectors and raw indices (instead of usage-checked 
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variables) are below our level of abstraction. We prefer to work with linear 
relatedness and 3-variables. 

By instantiating V in definition 4, we obtain resource-correct versions of 
familiar notions: letting V be 3 yields resource-correct renamings; and letting V 
be (ie., terms) yields resource-correct substitutions. 

We may informally assign variable names to the entries in the domain context. 


Example 3. Assume & is the natural numbers with ordering given by = and the 
usual addition and multiplication. There is a S-environment (a renaming) 


(6a: A,0b: B, 1c: C,0d: D) => (10,24, 4A). 


The mapping of variables to variables and matrix giving the linear map W are: 


0a: A,0b: B,le:C,0:D3c:C 0010 
la: A,0b: B,0c:C,0:D3a:A 1000 
la: A,0b: B,0c:C,0:D3a:A 1000 


Note that (6010) = (124)W. The first column of Y, corresponding to variable 
6a : A, contains two 1s because it has been duplicated (via contraction). The 
second and fourth columns are all 0 because variables b and d have been discarded 
(via weakening). The third column contains one 1 because c is used once. This 
1 appears above the 1s to its left because c has been permuted (via exchange) 
past a. Each of the rows in the matrix is a basis vector because variables can 
only be formed in contexts with basis-compatible annotations. 


Relocation An environment p : Py Æ Qô does not determine P and Q, we can 
replace them with any P’ and Q’ that are related by the linear map p.¥ (that 
is, the linear map of environment p): 


Lemma 1 (relocate). Given an environment p : Py > Qô and a P' anda 


Q' such that P’ < O'(p.W), there is also an environment of type P'y + O's 
with the same linear map and action on variables. 


Relocation will be used when pushing environments into subterms in section 6.3. 
Inductive Construction When V supports subusaging, we can construct a V- 


environment by cases on the shape of the target context by the following rules, 
which use the bunched connectives from section 3: 


r- (m: =a) 


(3 . (p,0) : > Ai, Ar (M): Æ% rA 


T* p: S4 k o: SA, 


Left to right, we can create an environment into the empty context when all 
usage annotations on the source context are 0; we can create an environment 
into a concatenated context when we can additively split up the annotations of 
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the source context and produce environments into both halves from the split 
sources; and we can create an environment into a singleton context rA when we 
can divide the source context by r and produce a V-value in the divided context 
of the appropriate type. 


Example 4. Assume & is the natural numbers with ordering given by = and the 
usual addition and multiplication, and F is the type of terms for a System with 
function application. There is an environment (substitution) 


((z), (yz)) : (Ox : A, 2y : B — C,3z: B) 5 (1B,2C). 


We rely on the observations that (0 2 3) = (0 0 1) + (0 2 2) and, on the right, 
that (022) = 2 (01 il). Then, we have Ox : A, Oy : B — C,lz: BE z: Band 
Or: A, ly : B — C,1z: BF yz:C. 


We could have used these rules to inductively define what environments are. 
However, we found that this was difficult to work with. It is often easier to do 
linear algebraic proofs separately from the rest of an environment. For example, 
for identity and composition of environments (below), definition 4 is easier to use 
because we can rely on the identity and composition of linear maps. Concretely, 
an inductive proof of identity would, for example, involve constructing an en- 


vironment of type Py, Qô X Py, Qô by constructing environments of types 


Py, 06 + Py and 0y, Qô =, Qô. These are not identity environments, so we 
would have to strengthen the induction hypothesis. 


Renameability Renamings, i.e. S-environments, are a particularly important case 
of environments. Renamings form a category, with identity and composition 
following from the identity and composition of linear maps. As in the work of 
Fiore et al. [9], presheaves over renamings are an important class of open families. 


In Agda code, we abbreviate = (which would usually be [ -3- ]-=>°_) as _=>".. 

In a setting where new variables can be bound in the middle of a derivation, it 
is important that the values we carry around while traversing a term can handle 
the existence of variables that appear but they do not use. We call any such 
notion of value renameable. The cofree renameable open type on an open type 
T is O” T (unrelated to the O combinator mentioned at the end of section 4.3), 
with T then being renameable if it forms a O"-coalgebra. 


Definition 5. For T an open type, (A"T) I := v| ((-) => r) > T|: That 


is, O" T holds at T when T holds not only at I, but also at any other T% which 
renames to I’. 


Definition 6. We say that T is renameable whenever there is a function ren? : 
V[ T > O"T |. That is, whenever T holds at T, it also holds at any T which 


renames to T. 
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A renameable notion of value gives rise to a renameable notion of environ- 
ment, essentially by renaming each contained value in an appropriate way. On 
the other side, environments admit renamings of their codomains in the opposite 
direction to that given by renameability. 


Lemma 2 (ren*Env). Jf (—) Z A is renameable for all A, then so is (—) A 
for all A. 


Lemma 3. From T + A and A ==> O, we get I 9. 


Proof sketch. Notice that the lookup component of an environment maps vari- 
ables in the codomain to values in the domain. We can apply the renaming to 
these variables. 


6 Semantics 


Given a -environment I" => A, the function semantics we define in this section 
assigns a C-value in context I" to every term in context A, where C is an OpenFam 
being the carrier of the semantic interpretation of terms (V being the semantic 
interpretation of variables). Before we can define semantics, we need to treat 
recursion through rules’ premises (section 6.1) and extension of environments 
when going under variable binders (section 6.2). 


6.1 A layer of syntax is functorial 


A basic property of the universe of syntaxes we described in section 4 is that 
every syntax supports a functorial action on subterms, realised by the function 
map-s. Its type says that to map a function fover a layer of syntax, there must be 
a linear map F relating the domain and codomain usage contexts, and fshould be 
usable wherever the domain and codomain usage contexts are similarly related 
by F. 


map-s : (s : System) > 
(V{O P Q} > F .re P Q OVX O (ctx P 7) > YO (ctx Q 6) ]) > 
(V {P Q} > F rel P Q > V[[ s ]s X (ctx Py) > [ s ]s Y (ctx Q ô) ]) 


This generality is needed because usage contexts change between a term 
and its immediate subterms—they are decomposed according to the bunched 
connectives used in the rules. X and Y are ExtOpenFams, with © being the 
context extension for a subterm (i.e., the variables newly bound in that subterm). 
Unlike usage annotations, types in the contexts y and ô, and the conclusion 
types implicit here, are preserved throughout. This is the essence of the usage 
annotation based approach—we use traditional techniques for variable binding, 
with the usage annotations layered on top. 

The heart of map-s is map-p, which recursively works through the structure ps 
of premises of the rule applied, acting on each subterm it finds. Here, particularly 
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in the clauses for ‘* and “^, we see why it is not enough for the function on 
subterms to apply at usage contexts P and Q — rather, it also needs to apply at 
any similarly related P’ and Q'. In the case of ‘x, we have that P < Pm + Py, 
with M and N being collections of subterms in usage contexts Pm and Py, 
respectively. Linearity of F yields Qm and Qy such that Q < Oy + Qn and 
we use map-p recursively at (Pm, Qm) and (Pn, Qu) on M and N. The cases 
for ‘- and ‘J* are similar, each using a different aspect of linearity. In contrast, 
the cases for ‘1 and ‘x, which are the only constructors used in fully structural 
systems, do not involve any changes in the usage contexts. 


map-p : (ps : Premises) > 
(V {0 P Q} F.urel P Q > VI XO (ctx P 7) > YO (ctx Q 5) ]) > 
(V {P Q} > F rel P Q > [| ps |p X (ctx P y) > [ ps ]p Y (ctx Q ô)) 


map-p(I‘+ A) frM =frM 

map-p ‘i fr- = 

map-p (ps ‘x qs) fr(M, N) = map-p ps fr M , map-p gs fr N 
map-p ‘I* fr I*( sp0) = I*( F .rel-Om (sp0, r) ) 


map-p (ps ‘* qs) fr(M *( sp+) N) = 
let rM N, spt! „/ rN = F .rel-+m (sp+, 7) in 
map-p ps f rM M *( sp+ ) map-p qs f rN N 
map-p (p `- ps) fr ({ sp*)- M) = 
let’ , sp” = F .rel-*m (sp*, r) in 
( sp¥ )- map-p ps f M 


6.2 The Kripke function space 


At this point we introduce a minor generalisation to OpenFam and ExtOpenFam: 
I —OpenFam and / —ExtOpenFam. We obtain the definition of /—OpenFam by 
replacing the textual occurrence of Ty by the parameter 7. 

The definition Kripke VC A is a kind of function space that describes a C 
value parametrised by A-many additional Vs (all correctly typed and usage 
annotated). It is used to describe how to go under binders in a Higher-Order 
Abstract Syntax style—to go under a binder we must provide semantic interpre- 
tations for all the additional variables: 


Kripke : (V : OpenFam v) (C : I —OpenFam c) + I —ExtOpenFam - 
Kripke = Wrap àA V C AT ASO? ([VJ=* A =° [C]F A)T 


Wrap is a device that turns any type family into an equivalent type family that 
is judgementally injective in its indices, which helps with Agda’s type inference. 
It turns the type family into a parametrised record with a single field get whose 
type is the type in the body of the A-abstraction. For understanding the meaning 
of Kripke, Wrap can be ignored. 

If A is of the form sı Bı,...,SnBn, then Kripke Y C A T A is equivalent to 
T (s, -° [V] B, =° «++ ° sn -° [V JF By =° [C ]-F A) T by Currying. That 
is to say, the Kripke function is expecting a value for each newly bound variable, 
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at the multiplicity of its annotation, together with the resources supporting 
each of those values. We use the “magic wand” function space here to enforce 
the invariant that the freshly bound variables have usage annotations that are 
added to the existing variables, not shared with them. The use of the O” modality 
ensures that we can still use it in the presence of additional variables introduced 
by weakening. 

Kripke is functorial in the C argument, as witnessed by the mapKC function, 
which is essentially post-composition: 


mapKC : Y {A B} > V[[CJFA>[C’ JE B] > 
V{A T} > Kripke VC AT A> Kripke YV C” AT B 
mapKC fb .get ren .app* sp p = f (b .get ren .app* sp p) 


6.3 Semantic traversal 


We can now state the data required to implement a traversal assigning semantics 
to terms. For open families V and C, interpreting variables and terms respectively, 
we assume that V is renameable, that V is embeddable in C, and that we have 
an algebra for a layer of syntax, where bound variables are handled using the 
Kripke function space: 


record Semantics (d : System) (V : OpenFam v) (C : OpenFam c) 
: Set (suc OL U v U c) where 


field 
ren^VY : V {A} — Renameable ([ V ] F A) 
[var] :Y[ VY >C] 


[con] : V[ | d]s (Kripke V C) > C ] 


We mutually define the action semantics and its lemma body. The purpose of 
semantics is to turn a term into a C-value using a /-environment and the fields 
of Semantics. Meanwhile, body does a similar job, but also deals with newly 
bound variables. In particular, body takes a term in a context extended by O, 
and produces a Kripke function from )-values for O to C-values. 


semantics : Y {r A} > [V] T >° A > Y {sz} > 
vV[[d, s] AF_>[C]Ire_] 

body : V {r A} > [YV] T =° A > V {sz 0} > 

VI Scope [ d , sz ]--- O A Kripke VC OT] 


To implement the new recursor semantics, we use the standard recursor, which 
in one case gives us a variable v, and in the other gives us a structure of subterms 
M, each of which is in an extended context. To deal with a variable v, we look 
it up in the environment p, then use the [var] field to map the resulting V-value 
to a C-value. To deal with a structure of subterms M, we use the functoriality 
of the syntactic structure to consider each subterm separately. On a subterm, 
we apply body, which amounts to a recursive call to semantics with an extended 
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environment. Recall that relocate (lemma 1) adjusts the environment p to work 
in the usage contexts of the subterms. 


semantics p (‘var v) = [var] $ p .lookup (p .fit-here) v 
semantics p (‘con M) = [con] $ 
map-s (p .W) d (A r— body (relocate p r)) (p .fit-here) M 


For body, we are given a subterm M, to which we want to apply semantics. To 
do so, we need an extended version of the initial environment p. We express this 
as the generation of a Kripke function that produces the extended environment 
given interpretations of the fresh variables. We take p, which is an environment 
covering A, and g, which is an environment covering O, and glue them together 
using the inductive rules for generating environments, after having renamed p 
via lemma 2 to make it fit the new context I’* (intended to be I’ ++° O): 


extend: V {Il AO} > 
[V] fl =° A > Kripke V ([ V ] =°_) OT (A ++ 9) 
extend p .get ren .app* sp o = ++4+° (ren*Env ren*V p ren *( sp ) o) 


To define body, we use mapKC to post-compose the environment extension 
by the A-function taking an extended environment and acting with it on M. 


body p M = mapKC (A o — semantics o M) (extend p) 


7 Example traversals 


In this section, we provide three example uses of semantic traversals: generic 
renaming and substitution, a usage elaborator, and a denotational semantics. 
The reader is also encouraged to see the far greater range of examples in the 
work of Allais et al. [3], which should adapt to our usage-annotated setting. 
Renaming and substitution are essential results, while the latter two examples 
focus on usage annotations. 

A result we will use throughout this section is reification. When we have an 
index-preserving mapping from usage-checked variables to V-environments, we 


can construct environments of the form rT => T (identity environments) for 
all I’. This lets us write the reify function, which simplifies our obligations in 
giving a Semantics by coercing Kripke functions into just C-values in an extended 
context. 


Lemma 4 (reify). If V is an open family such that there is a function v : 


V[ a> V ], we get a function of type Y| Kripke VC > ScopeC ] for any C. 


Proof. Let b : Kripke VC AT A. That is, b is a Kripke function yielding C- 
computations We want to apply b so as to get a C (T, A) A. Let Py = T and 
Qô = A. The O” in the type of b allows us to reverse-rename I to T, 08. Then we 
give the —=-function an argument in context 0y, A, noting that (T, 08)+(0y, A) = 
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(I, A), as we wanted from the result. The argument needs type 07, A = A. 


We produce this via lemma 3 from an environment p : 0y, A =, Oy, A created 
using v and a renaming which is the complement to that used on O”. 


All of the Vs used in examples in this paper support identity environments. 
However, Allais et al. [3, p. 27] give some important examples that do not sup- 
port identity environments, and thus cannot use reify (lemma 4). The feature 
that causes the lack of support for identity environments is that a semantics can 
make use of the fact that only variables of particular kinds are bound by the 
syntax. In the examples of Allais et al., a bidirectionally typed language only 
binds variables that synthesise their type, as opposed to those whose type is 
checked. The semantics of type-checking and elaboration rely on variables syn- 
thesising their type, so V is chosen to cover only those variables. Instead of using 
reify, we must observe that each syntactic form only binds such synthesising vari- 
ables. Similar phenomena would appear in, say, a call-by-value language where 
variables are values (not computations), or a polarised language where variables 
always have a polarity matching their type. 


7.1 Renaming and substitution 


In an unpublished note, McBride [15] gives a parametrised traversal yielding 
homomorphisms of syntax, the canonical examples of which are simultaneous 
renaming and simultaneous substitution. The parameters are collected in the 
record Kit. We make a minor change to the original presentation, where instead 
of our ren*Y field, McBride has the field wk allowing only context extensions. As 
for the other two fields, vr allows us to map variables to V-values, so as to put 
newly bound variables in environments; and tm allows us to extract terms from 
Y-values, as required when we use the environment to handle a free variable. 


record Kit (d : System) (V : OpenFam v) : Set (suc 0¢ U v) where 


field 
ren*V : Y {A} — Renameable ([ V ] E A) 
vr Vb aS ¥ ] 


tm :VY[YVY >[d, œ]F-] 


Where McBride gave the traversal explicitly, we go via our generic semantic 
traversal. The first two fields of Semantics derive directly from fields of Kit. 
Meanwhile, to handle term constructors, we first reify to get a collection of 
traversed subterms, and then use ‘con to assemble these subterms into a similarly 
shaped syntactic form as we started with. The vr field is used implicitly in reify, 
as it is used to show that V-identity environments exist. 


kit+sem : Kit d V + Semantics d V [ d, oo J+_ 

kit—>sem K .ren*V = K .ren*V 

kit—sem K [var] = K .tm 

kit—sem {d = d} K .[con] = ‘con o map-s’ d reify 
where open Kit K using (identityEnv) 
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The action of a syntactic traversal on logical rules is basically fixed: we pre- 
serve the logical rule and extend the environment with any newly bound variables 
according to vr. Meanwhile, the action on variables is relatively unconstrained: 
we look up the variable in the environment to get a V-value, then transform that 
y-value into a term using tm. 

The idea of simultaneous renaming is that variables replace variables, whereas 
with simultaneous substitution, terms replace variables. This translates to en- 
vironments for renaming containing S-values (variables), and environments for 
substitution containing -values (terms). 


Ren-Kit : Kit d _S_ 
Ren-Kit = record { ren*V = ren*S ; vr = id ; tm = ‘var } 


Notice that ren^F, witnessing the fact that terms are renameable, is a corol- 
lary of Ren-Kit. 


Sub-Kit : Kit d | d, o0 ] H- 
Sub-Kit = record { ren*V = ren^F ; vr = ‘var ; tm = id } 


7.2 A usage elaborator 


Using the constructs we have seen so far, producing example terms soon becomes 
extremely tedious. We can achieve some abbreviation by using pattern synonyms 
to wrap around ‘con expressions, but we still have to produce essentially bespoke 
proofs whenever we use a usage-sensitive part of the syntax. The size of each 
of these proofs is roughly proportional to the number of free variables, so the 
amount of proof we have to write grows roughly quadratically with the size of 
terms. An additional factor, which we can’t see on paper, is that type checking 
time for these proofs soon becomes prohibitive to interactive development. 

Our aim in this subsection is to automate usage constraint proofs, making 
terms both easier to write and more performant to check. We invoke the automa- 
tion by writing terms in a syntax where usage constraints have been trivialised, 
and then use a semantic traversal over the simplified syntax to try to produce a 
fully elaborated term in the original syntax. We write the automation in a way 
that is generic in the syntax description, thus avoiding repetition and facilitating 
the prototyping of new type systems. 

The type of syntax descriptions depends on the type of usage annotations 
because of variable binding. For example, in the !r-E rule of figure 2, the right 
premise binds a new variable with annotation r, where r is drawn from the 
ambient posemiring. The scaling combinator also makes direct reference to the 
posemiring. To produce a simplified syntax description, where usage constraints 
are trivialised, we set the ambient posemiring to the 1-element 0 posemiring. In 
contrast to syntax descriptions, even though types can contain usage annota- 
tions, the type of types does not depend on the type of usage annotations. This 
means that, in our simplified syntax, terms have types from the original system 
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even though variables have trivial usage annotations. We define the 0 posemir- 
ing as follows, being careful to use the 0-field record type T so that everything 
algebraic gets solved by Agda’s 7-laws. Indeed, in this very definition, all of the 
semiring operations and laws are canonically inferred. 


0-poSemiring : PoSemiring 0@ 02 0 
0-poSemiring = record 
{ Carrier = T; ~= å- --> T; -<_=A__~T} 


The elaboration process is monadic. In particular, we use the List/non- 
determinism monad to give all of the possible annotation choices on the free 
variables of a term. We believe the commitment to multiple solutions is inherent 
when the syntax contains ‘i. For example, in the intermediate stages of elabo- 
rating (F Aw. (*,*)) : A — T & T with a usage counting posemiring (assuming 
reasonable rules for T and ®), it is unclear whether to use the variable x in the 
left x or the right *. This uncertainty should be reflected in the final result. 

The non-deterministic choices we make during elaboration are enumerated 
by the fields of NonDetlnverses. These choices are driven by the typing rules 
and a candidate usage vector for the conclusion. For example, +~! ris needed 
when we encounter a ‘x in the syntax and the candidate usage annotation we are 
considering is r. Then, +~! ris a list of pairs of annotations p and q that r can 
split into, together with a proof of the splitting. For 0#4~! and 1471, inverses to 
constants, we are given the candidate r and typically return an empty list if the 
constraint cannot be satisfied, or a singleton list containing a proof. *~! is used 
when we encounter scaling, in which case we know both the scaling factor r (from 
the syntax description) and the candidate g. These inverse operations combine 
monadically (in fact, applicatively) to give inverses to the vector operations of 
zero, addition, scaling, and basis. 


record NonDetInverses : Set where 
field 
O#7*: (r: Ann) = List (r < 0#) 
+74: (r: Ann) > List (A \ (wœ, 0: -x -)>r<p+q) 
1#71 : (r: Ann) > List (r < 1#) 
*—1 : (rq: Ann) > List (4 \p>q<r* p) 


We choose the V of our semantics to be (unannotated) variables. For the C, 
we consider functions from candidate usage vectors F to the list of elaborated 
derivations with usage annotations given by R. The protocol this encodes is 
that the user will provide an unannotated term together with a candidate usage 
context R, and usage elaboration returns a list of possible ways the term could 
be annotated such that the conclusion has usage context R. The module name U 
refers to the fact that we are taking the ambient posemiring to be 0 in OpenFam. 
The effect on OpenFam is that the usage annotations of any contexts we consider 
are uninformative (hence the - on the left). 


C : System — U.OpenFam _ 
C sys (U.ctx - y) A = Y R > List ([ sys , co ] ctx R y F A) 
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To traverse the unannotated terms, we produce a Semantics over the unanno- 
tated system uSystem sys. To write it, we make use of idiom brackets (|... |), which 
have the effect of replacing top-level spines of applications by (List-)applicative 
applications. Field by field, we already know that variables are renameable. To 
interpret a variable, we consider all the possible proofs that such a variable could 
be well annotated, and package them up as a variable term via the applicative 
machinery. Finally, for compound terms, we first reify the unannotated subterms, 
and then combine the subterms via a lemma. 


elab-sem : Y sys ++ U.Semantics (uSystem sys) U.-_ (C sys) 
elab-sem sys .ren°V = U.ren*3 
elab-sem sys .[var] (U.lvar i q -) R= 
( ‘var q (Ivar i a) (Ci l-* R) D) 
elab-sem sys .|con] b R = 
let rb = U.map-s’ (uSystem sys) U.reify b in 
( ‘con (lemma sys rb) ) 


The lemma essentially goes through the shape of the premises, combining 
the collections of subterms in the natural way. For example, at each _x_, we 
take the Cartesian product of the possibilities of each half, and at each _*_, 
we non-deterministically split the usage annotations coming in, and then take 
the Cartesian product. When it comes to newly bound variables, the syntax 
description tells us their annotations, so there is no further non-determinism 
introduced here. 


lemma : V (sys : System) {A T} > 
U.[ uSystem sys Js (U.Scope (C sys)) (uCtx T) A > 
List (| sys Js (Scope [ sys , œ ]--_) T A) 


To actually use elab-sem on terms, we take the associated semantics and pass 
it the identity environment (an identity renaming in this case, because VY is 
a family of variables). We use elab-unique, which further checks statically that 
exactly one derivation is returned. The candidate usage vector R will be [] for 
closed terms, and otherwise we have to supply the intended usage annotations. 

We can now use the elaborator to automatically infer the usage annotations 
for the example at the end of section 4.2. This allows us to write: 


cojoin-!w : Y {A} > [AR , œ ] [F F (! w# A — ! wH (! w# A)) 
cojoin-!w = elab-unique _ (—ol (!E (var# 0) (!I (!I (var# 1))))) [ 


We have instantiated the usage elaborator so that: 0#4~! is a singleton on 0 and 
w, and empty on 1; 147! is a singleton on 1 and w, and empty on 0; +7! gives 
0+ [(0,0)], 1 = [(0,1), (1,0)], and w +> [(w,w)]; and *~1 gives (w,0) — [0], 
(w,1) + [J], and (w,w) > [w] (omitting (0,_) and (1, _) cases for brevity). Note 
that we do not consider splitting w up as, say, 1++-w, because this splitting would 
introduce more non-determinism but not allow any more terms to be typed. As 
such, the only non-determinism comes when we have variables annotated 1 and 
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need to do an additive split, like when we apply the !E rule below. At this point, 
the variable can become either 0-annotated in the left subterm and 1-annotated 
on the right, or vice-versa. We will find that, because the left subterm wants 
to use that variable, the former choice will be rejected. The function var# is 
a convenience for converting statically known natural numbers, representing de 
Bruijn levels, into variable terms. 


7.3 A denotational semantics 


To justify the name semantics, we give an example traversal that is a denota- 
tional semantics in the usual sense. The semantics we take is a refinement of 
that of Abel and Bernardy [2], which gives a way to extract parametricity theo- 
rems from substructurally typed programs. Example theorems are that all linear 
terms act as permutations on some fixed set of resources, and all monotonically 
typed terms are really monotonic in the way the typing suggests they are. 

To abbreviate this section, we use a simplified syntax compared to AR. We 
allow for an arbitrary family of base types BaseTy, and a single type former 
(r, A) — B, equivalent to (! r A) — B from the earlier system. 


data Ty : Set where 
base : BaseTy > Ty 
_—o_:(rA: Ann x Ty) (B: Ty) > Ty 


In the term syntax, A-abstraction now binds a variable with annotation r, 
while application needs to scale its argument by r (both in accordance with the 
function type they are acting on). 


data ‘AnnArr : Set where 
‘lam ‘app: (rA : Ann x Ty) (B : Ty) > ‘AnnArr 


AnnArr : System 
AnnArr = ‘AnnArr > where 
(‘lam rA B) > ([rA]° + B) = rA — B 
(‘app rA@(r, A) B) > ( [° 'F rA — B) ‘x r` (IS FA) = B 


As a running example, we take the usage annotations to be the 4-element 
variance posemiring (example 2). We establish the property that all terms are 
monotonic in their free variables. This monotonicity can be covariant or con- 
travariant (or neither or both) depending on the annotation of each free variable. 
This provides an additional example to those of Abel and Bernardy. 

We will take semantics of this system into world-indezed relations [2, 5]. A 
world-indexed relation (WRel) over a poset of worlds W is a set over which we 
have a W-indexed binary relation satisfying a presheaf-like property with respect 
to the order on W. The Agda code for world-indexed relations and constructions 
on them can be found in Wood and Atkey [22]. 


Example 5. When W is the 1-element set, a world-indexed relation is just a set 
equipped with a binary relation. 


A Framework for Substructural Type Systems 399 


Morphisms (WRelMor) between world-indexed relations R and S consist of 
a mapping between the underlying sets such that, at each fixed world w, the 
mapping preserves relatedness from R to S. 

When the poset of worlds forms a (relational) commutative monoid, such 
world-indexed relations support a symmetric monoidal closed structure, with 
objects denoted I”, @"_, and —o®_,. These reuse the bunched connectives I*, 
x, and —, now over worlds rather than contexts. 

The final piece of semantics we need is a bang operator. We allow the semantic 
bang to be an arbitrary annotation-indexed functor on world-indexed relations. 
This functor must respect all of the structure on the indices, making it a graded 
comonad over multiplication, as well as being lax monoidal at any particular 
index r. These laws are listed in the Generic.Linear.Example.WRel module in 
[22]. 


Example 6. With W being the l-element set and annotations coming from the 
variance semiring, we can define the following bang. It is always the identity 
on the set component, while the relation component consists of flipping the 
relation for contravariance and taking conjunctions to achieve both covariance 
and contravariance. When we want neither covariance nor contravariance, we 
use the always true predicate on worlds 1. 


(e; WayUp > WRel -<"_ > WRel -<"’_ 
I? a R set = R set 

IR ++ R rel = R rel 

IR |J R rel zy= R rel yx 

1k 2? R rdyy=i 

IR ~~ R rel y= R rel zyx R.rel yx 
IR a R .subres _ = id 


The semantics of a type is given by |-], which maps into world-indexed 
relations. The function type is interpreted using —o” and !?. Contexts are in- 
terpreted by [_]°, using @” and |”. Terms are interpreted as morphisms by the 
open family [+]. Variables are interpreted by lookup” (definition omitted). 


lookup : V {r AJ} >ra A>[rHA] 


Now we give a Semantics. The choice of V as -3 is somewhat arbitrary, given 
that a standard denotational semantics would not use intermediate environments 
in the same sense as renaming and substitution, but it allows us to reuse the 
standard facts that variables support renaming and identity environments. With 
this choice of V and C, we interpret environment entries by lookup”. Meanwhile, 
for the logical rules, we ignore environments by using reify to just deal with 
morphisms in an extended context. As such, -abstractions are easy to interpret, 
while applications require some massaging to remove the extension by an empty 
context, followed by some plumbing to split the interpretation of the context 
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according to the usage constraints and feed the interpretation of the argument 
n’ into the interpretation of the function m. 


Wrel : Semantics AnnArr -3- [-F_] 
Wrel .ren*V = ren*S 
Wrel [var] = lookup? 
Wrel .[con] (‘lam (r, A) B, =.refl , m) = curry® (reify m) 
Wrel .[con] {ctx R y} 

(‘app (r, A) B , =.refl , *()- {P} {rQ} m sp+ (()--{Q} sp* n)) = 

let n : WRelMor [ ctx Qy J [ A ] 

n! = reify no® @F-unit”— 

m : WRelMor (| ctx Py ]° @2 12 r ADIB] 

m = uncurry® (reify m o? @®-unit”~) in 
m oP map-@* id? (!%-map n’ of ctx-* sp*) oP ctx-+ sp+ 


Then, the semantics of terms is given by the function semantics Wrel 1”, 
where 1” is the identity renaming. 


Example 7. We can make a subtraction function from primitive addition and 
negation on integers. Subtraction is covariant in its first argument and con- 
travariant in its second argument. We give the definition in pseudocode, though 
it is also amenable to the usage elaborator of section 7.2, suitably instantiated. 


~np: TTZ — MZ — Z,~~n: JZ — ZF minus : MZ — HZ — Z 
minus = Ax. Ay. pz (ny) 


After feeding in Agda’s addition and negation functions as the interpretations 
of the free variables (noting that they are both monotonic in the required way), 
we get the following free theorem. 


thm : s Z.< V > y Z.< y> z Z.+ (Z.- y) Z.< £ Z.+ (Z- y') 


8 Conclusions 


We have presented a framework for doing metatheory for a class of substructural 
type systems in Agda. The framework gives us renaming, substitution, and a 
usage elaborator for new syntaxes for free, which we hope can facilitate proto- 
typing and the mechanisation of more interesting semantic results. Beside the 
mechanised framework itself, we believe its methodology — the use of bunched 
premise combinators — can guide and simplify the development of (potentially 
unmechanised) substructural type systems. 

Our account of substructurality is based on the linear algebraic principles 
described by Wood and Atkey [21]. However, these details only really affect the 
definition of environment, in which the use of linear maps is motivated by them 
being the standard notion of morphism between vectors. We could imagine that 
a similar notion of morphism is found for the kind of annotations found in Licata 
et al. [13], allowing a framework to consider finer substructural systems. 
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Abstract. Over twenty years ago, Abadi et al. established the Depen- 
dency Core Calculus (DCC) as a general purpose framework for ana- 
lyzing dependency in typed programming languages. Since then, depen- 
dency analysis has shown many practical benefits to language design: its 
results can help users and compilers enforce security constraints, elim- 
inate dead code, among other applications. In this work, we present a 
Dependent Dependency Calculus (DDC), which extends this general idea 
to the setting of a dependently-typed language. We use this calculus to 
track both run-time and compile-time irrelevance, enabling faster type- 
checking and program execution. 
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1 Dependency Analysis 


Consider this judgment from a type system that has been augmented with de- 
pendency analysis. 


x: Int, y:” Bool, z:“ Bool H if zthenzelse3 :” Int 


In this judgment, L, M and H stand for low, medium and high security levels 
respectively. The computed value of the expression is meant to be a medium- 
security result. The inputs, z, y and z have been marked with their respective 
security levels. This expression type-checks because it is permissible for medium- 
security results to depend on both low and medium-security inputs. Note that 
the high-security boolean variable y is not used in the expression. However, if 
we replace z with y in the conditional, then the type checker would reject that 
expression. Even though the high-security input would not be returned directly, 
the medium-security result would still depend on it. 

Dependency analysis, as we see above, is an expressive addition to program- 
ming languages. Such analyses allow languages to protect sensitive informa- 
tion [30,16], support run-time code generation [33], slice programs while pre- 
serving behavior [34], etc. Several existing dependency analyses were unified by 
Abadi et al. [1] in their Dependency Core Calculus (DCC). This calculus has 
served as a foundation for static analysis of dependencies in programming lan- 
guages. 
© The Author(s) 2022 
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What makes DCC powerful is the parameterization of the type system by a 
generic lattice of dependency levels. Dependency analysis, in essence, is about 
ensuring secure information flow—that information never flows from more secure 
to less secure levels. Denning [13] showed that a lattice model, where increasing 
order corresponds to higher security, can be used to enforce secure flow of in- 
formation. DCC integrates this lattice model with the computational A-calculus 
[22] by grading the monad operator of the latter with elements of the former. 
This integration enables DCC to analyze dependencies in its type system. 

However, even though many typed languages have included dependency anal- 
ysis in some form, this feature has seen relatively little attention in the context 
of dependently-typed languages. This is unfortunate because, as we show in this 
paper, dependency analysis can provide an elegant foundation for compile-time 
and run-time irrelevance, two important concerns in the design of dependently- 
typed languages. Compile-time irrelevance identifies sub-expressions that are not 
needed for type checking while run-time irrelevance identifies sub-expressions 
that do not affect the result of evaluation. By ignoring or erasing such sub- 
expressions, compilers for dependently-typed languages increase the expressive- 
ness of the type system, improve on compilation time and produce more efficient 
executables. 

Therefore, in this work, we augment a dependently-typed language with a 
primitive notion of dependency analysis and use it to track compile-time and 
run-time irrelevance. We call this language DDC, for Dependent Dependency 
Calculus, in homage to DCC. Although our dependency analyses are structured 
differently, we show that DDC can faithfully embed the terminating fragment 
of DCC and support its many well-known applications, in addition to our novel 
application of tracking compile-time and run-time irrelevance. 

More specifically, our work makes the following contributions: 


— We design a language SDC, for Simple Dependency Calculus, that can ana- 
lyze dependencies in a simply-typed language. We show that SDC is no less 
expressive than the terminating fragment of DCC. The structure of depen- 
dency analysis in SDC enables a relatively straightforward syntactic proof 
of non-interference. (Section 3) 

— We extend SDC to a dependent calculus, DDC'. Using this calculus, we 
analyze run-time irrelevance and show the analysis is correct using a non- 
interference theorem. DDC! contains SDC as a sub-language. As such, it 
can be used to track other forms of dependencies as well. (Section 4) 

— We generalize DDC! to DDC. Using this calculus, we analyze both run- 
time and compile-time irrelevance and show that the analyses are correct. 
To the best of our knowledge, DDC is the only system that can distinguish 
run-time and compile-time irrelevance as separate modalities, necessary for 
the proper treatment of projection from irrelevant X-types. (Section 5) 

— We have used the Coq proof assistant to mechanically verify the most impor- 
tant and delicate part of our designs, the non-interference and type sound- 


f https: //github.com/sweirich/graded-haskell 
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ness theorems for DDC. This mechanization is available online* and as a 
self-contained artifact [11]. 


2 Irrelevance and Dependent Types 


Run-time irrelevance (sometimes called erasure) and compile-time irrelevance 
are two forms of dependency analyses that arise in dependent type theories. 
Tracking these dependencies helps compilers produce faster executables and 
makes type checking more flexible [27,19,6,20,3,18,4,24,32,23]. 


2.1 Run-time irrelevance 


Parts of a program that are not required during run time are said to be run- 
time irrelevant. Our goal is to identify such parts. Let’s consider some examples. 
We shall mark variables and arguments with T if they can be erased prior to 
execution and leave them unmarked if they should be preserved. 

For example, the polymorphic identity function can be marked as: 


id : IT x:'Type. x -> x 
id = A'x. Ay. y 


The first parameter, x, of the identity function is only needed during type check- 
ing; it can be erased before execution. The second parameter, y, though, is 
required during runtime. When we apply this function to arguments, as in (id 
Bool’ True), we can erase the first argument, Bool, but the second one, True, 

must be retained. 

Indexed data structures provide another example of run-time irrelevance. 

Consider the Vec datatype for length-indexed vectors, as it might look in a 
core language inspired by GHC [31,35]. The vec datatype has two parameters, n 
and a, that also appear in the types of the data constructors Nil and Cons. These 
parameters are relevant to Vec, but irrelevant to the data constructors. (In the 
types of the constructors, the equality constraints (n ~ Zero) and (n ~ Succ m) 
force n to be equal to the length of the vector.) 


Vec : Nat -> Type -> Type 

Nil : M n:' Nat. I a:'Type. (n ~ Zero) => Vec n a 

Cons : I n:' Nat. I a:' Type. I m: Nat. (n ~ Succ m) => a -> Vec m a 
-> Vec na 


Now consider a function vmap that maps a given function over a given vector. 
The length of the vector and the type arguments are not necessary for running 
vmap; they are all erasable. So we assign them T. 


vmap : I n:' Nat. a b:'Type. (a -> b) -> Vec n a -> Vec n b 
vmap = \' nab. Af xs. 
case xs of 
Nil -> Nil 


Cons m! x xs -> Cons m’ (f x) (vmap m! a! b! f xs) 
P 
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Note that the T-marked variables m, a and b appear in the definition of vmap, 
but only in T contexts. By requiring that ‘unmarked’ terms don’t depend on 
terms marked with T, we can track run-time irrelevance and guarantee safe era- 
sure. Observe that even though these arguments are marked with T to describe 
their use in the definition of vmap, this marking does not reflect their usage in 
the type of vmap. In particular, we are free to use these variables with Vec in a 
relevant manner. 


2.2 Compile-time Irrelevance 


Some type constructors may have arguments which can be ignored during type 
checking. Such arguments are said to be compile-time irrelevant. For example, 
suppose we have a constant function that ignores its argument and returns a 


type. 


phantom : Nat! -> Type 
phantom = A! x. Bool 


To type check idp below, we must show that phantom O equals phantom 1. 
Without compile-time irrelevance, we need to -reduce both sides to show that 
the input and output types are equal. 


idp : phantom ol -> phantom 1" 
idp =Ax. x 


However, in the presence of compile-time irrelevance, we can use the de- 
pendency information contained in the type of a function to reason about it 
abstractly. Because the function f below ignores its argument, it is sound to 
equate the input and output types. 


ida : M f :'(Nat' -> Type). f o -> £1" 
ida = \' f. Xx. x 


In the absence of compile-time irrelevance, we cannot type-check ida. So 
compile-time irrelevance makes type checking more flexible. 

Compile-time irrelevance can also make type checking faster when the types 
contain expensive computation that can be safely ignored. For example, consider 
the following program that type checks without compile-time irrelevance. How- 
ever, in that case, the type checker must show that fib 28 reduces to 317811, 
where fib represents the Fibonacci function. 


idn : MH f :' (Nat! -> Type). f (fib 28)! -> f 317811" 
idn = \' f. Xx. x 


So far, we have used two annotations on variables and terms: T for irrelevant 
ones and ‘unmarked’ for relevant ones. We used T to mark both arguments that 
can be erased at runtime and arguments that can be safely ignored by the type 
checker. However, sometimes we need a finer distinction. 
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2.3 Strong Irrelevant X-types 


Consider the type Xm:' Nat. Vec m a, which contains pairs whose first component 
is marked as irrelevant. This type might be useful, say, for the output of a filter 
function for vectors, where the length of the output vector cannot be calculated 
statically. If we never need to use this length at runtime, then it would be good 
to mark it with T so that it need not be stored.° 

However, marking m with T means that the first component of the pair of this 
type must also be compile-time irrelevant. This results in a significant limitation 
for strong X types: we cannot project the second component from the pair. Say 
we have ys: Xm:'Nat. Vec m a. The type of (mı ys) is a Nat that can only be 
used in irrelevant positions. However, note that the argument n in Vec n a must 
be compile-time relevant; otherwise the type checker would equate Vec 0 a with 
Vec 1 a, making the length index meaningless. The type of (m2 ys) would then 
be Vec (mı ys) a, which is ill-formed because an irrelevant term (7, ys) appears 
in a relevant position. 

Therefore, we don’t want to mark the first component of the output of filter 
with T. However, if we leave it unmarked, we cannot erase it at runtime, some- 
thing that we might want to. A way out of this quandry comes by considering 
terms that are run-time irrelevant but not compile-time irrelevant. Such terms 
exist between completely irrelevant and completely relevant terms. They should 
not depend upon irrelevant terms and relevant terms should not depend upon 
them. We mark such terms with a new annotation, C, with the constraints that 
‘unmarked’ terms do not depend on C and C terms do not depend on T terms. 
The three annotations, then, correspond to the three levels of a lattice modelling 
secure information flow, with L < C < T, using L in lieu of ‘unmarked’. We 
call the lattice £z, for irrelevance lattice. Using this lattice, we can type check 
the following filter function. 


filter : [[n:'Nat.I[a:'Type.(a -> Bool) -> Vec n a -> Yim:°Nat. Vec m a 
filter = \' na. À f vec. 
case vec of 
Nil -> (Zeroľ, Nil) 
Cons ni’ x xs 
| f x -> ((Succ (mı ys))°, Cons (71 ys)" x (m2 ys)) 
where 


47 le 


ys = filter nl a f xs 


TE -> filter ni’ al f xs 


Eisenberg et al. [14] observe that, in Haskell, it is important to use projection 
functions to access the components of the pair that results from the recursive call 
(as in mı ys and m2 ys) to ensure that filter is not excessively strict. If filter 
instead used pattern matching to eliminate the pair returned by the recursive 


5 Tt is, however, safe for m to be used in a relevant position in the body of the X-type 
even when it is marked with T. This marking indicates how the first component of 
a pair having this type is used, not how the bound variable m is used in the body of 
the type. 
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call, it would have needed to filter the entire vector before returning the first 
successful value. This filter function demonstrates the practical utility of strong 
irrelevant X-types because it supports the same run-time behavior of the usual 
list filter function but with a more richly-typed data structure. 


3 A Simple Dependency Analyzing Calculus 


Our ultimate goal is a dependent dependency calculus. However, we first start 
with a simply-typed version so that we can explain our approach to dependency 
analysis and non-interference in a simplified setting. 

We call the calculus of this section SDC, for Simple Dependency Calculus. 
This calculus is parameterized by a lattice of labels or grades, which can also be 
thought of as security levels. An excerpt of this calculus appears in Figure 1; it 
is an extension of the simply-typed A-calculus with a grade-indexed modal type 
T! A. The modal type T° A can be thought of as putting a security barrier of 
grade £ around the values of A. The calculus itself is also graded, which means 
that in a typing judgment, the derived term and every variable in the context 
carries a label or grade. (The specification of the full system, which includes 
unit, products and sums, appears in the extended version of this paper [12].) 


3.1 Type System 


The typing judgment has the form 2+ a :* A which means that “£ is allowed 
to observe a” or that “a is visible at 2”. Selected typing rules for SDC appear in 
Figure 1. Most rules are straightforward and propagate the level of the sub-terms 
to the expression. 

The rule SDC-VAR requires that the grade of the variable in the context 
must be less than or equal to the grade of the observer. In other words, an 
observer at level £ is allowed to use a variable from level k if and only if k < Z. 
If the variable’s level is too high, then this rule does not apply, ensuring that 
information can always flow to more secure levels but never to less secure ones. 
Abstraction rule SDC-ABS uses the current level of the expression for the newly 
introduced variable in the context. This makes sense because the argument to 
the function is checked at the same level in rule SDC-APP. 

The modal type, introduced and eliminated with rule SDC-RETURN and 
rule SDC-BIND respectively, manipulates the levels. The former says that, if 
a term is (Z V o)-secure, then we can put it in an o-secure box and release 
it at level /. An &9-secure boxed term can be unboxed only by someone who 
has security clearance for lọ, as we see in the latter rule. The join operation in 
rule SDC-BIND ensures that b can depend on a only if b itself is €9-secure or 
lo < £. 


6 We use the terms label, level and grade interchangeably. 
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(Grammar) 
labels €,k = L|T|kAL|KVE|... 
types A,B := Unit| A> B|T' A 
terms a,b :=a|Ar:A.a|ab variables and functions 
| nf a| bind’ « =ainb graded modality 
contests 2 :=@|Q,0:°A 
QNeavtA (Typing rules) 
SDC-App 
SDC-VAR SDC-ABs Qb: A+B 
b< 2° AEN Qc Abo B QNravA 
Qeav A QE dAw:A.b* A+B QE ba B 
SDC-BInD j 
SDC-RETURN N- a:f TOA 
NRF a: ^. A Q0°%0AR DY B 
QA al TOA NF bind” z = ainb :* B 
a~ ad (Small step) 
SDCSTEP-BETA SDCSTEP-BINDBETA 
(rv: A.a) bf ~ afb/x} bind! z = nf ainb~ b{a/r} 


Fig. 1. Simple Dependency Calculus (Excerpt) 


3.2 Meta-theoretic Properties 


This type system satisfies the following properties related to levels. 

First, we can always weaken our assumptions about the variables in the 
context. If a term is derivable with an assumption held at some grade, then that 
term is also derivable with that assumption held at any lower grade. Below, for 
any two contexts 921, 22, we say that 2); < Nə iff they are the same modulo the 
grades and further, for any z, if r: A E€ RQ and 2:2 A € M, then 4 < b. 


Lemma 1 (Narrowing). If 2’t a: A and Q< 2, then NF a: A. 


Narrowing says that we can always downgrade any variable in the context. 
Conversely, we cannot upgrade context variables in general, but we can upgrade 
them to the level of the judgment. 


Lemma 2 (Restricted Upgrading). If Q,,2: A, Ry F b : Band t, <4, 
then Q,2:°Y% A, R FH b £ B. 


The restricted upgrading lemma is needed to show subsumption. Subsump- 
tion states that, if a term is visible at some grade, then it is also visible at all 
higher grades. 


Lemma 3 (Subsumption). If QF a: A andl < k, then NF a:* A. 
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Subsumption is necessary (along with a standard weakening lemma) to show 
that substitution holds for this language. For substitution, we need to ensure 
that the level of the variable matches up with that of the substituted expression. 


Lemma 4 (Substitution). If Q1, x: A, R2- b :£ Band Q,+ a :© A, then 
Ni, Ro F bf{a/z} £ B. 


SDC terms are reduced using a call-by-name strategy. An excerpt of the 
small-step semantics appears in Figure 1. Note how the labels on the introduction 
form and the corresponding elimination form match up in rules SDCSTEP-BETA 
and SDCSTEP-BINDBETA. Further, note that we could have also used a call-by- 
value strategy to reduce SDC terms; we chose a call-by-name strategy because 
our development is motivated by potential applications in Haskell. 

For a call-by-name operational semantics, the above lemmas allow us to 
prove, a standard progress and preservation based type soundness result, which 
we omit here. 

Next, we show that our type system is secure by proving non-interference. 


3.3 A Syntactic Proof of Non-interference 


When users with low-security clearance are oblivious to high-security data, we 
say that the system enjoys non-interference. Non-interference results from level- 
specific views of the world. The values n” True and 7" False appear the same 
to an L-user while an H-user can differentiate between them. To capture this 
notion of a level-specific view, we design an indexed equivalence relation on open 
terms, ~g, called indexed indistinguishability, and shown in Figure 2. To define 
this relation, we need the labels of the variables in the context but not their 
types. So, we use grade-only contexts ®, defined as ® := Ø | &,x: ¢. These 
contexts are like graded contexts (2 without the type information on variables, 
also denoted by |92]. 

Informally, F a ~e b means that a and b appear the same to an ¢-user. For 
example, n” True ~z 7 False but =—(ņn” True ~y n” False). We define this 
relation ~, by structural induction on terms. We think of terms as ASTs anno- 
tated at various nodes with labels, say o, that determine whether an observer 
£ is allowed to look at the corresponding sub-tree. If lo < £, then observer £ can 
start exploring the sub-tree; otherwise the entire sub-tree appears as a blob. So 
we can also read ®F a ~ọ b as: “a is syntactically equal to b at all parts of the 
terms marked with any label £9, where fy) < £, but may be arbitrarily different 
elsewhere.” 

Note the rule SGEQ-RETURN in Figure 2. It uses an auxiliary relation, ® Ee 
a, ~ ag. This auxiliary extended equivalence relation ® Ee a, ~ @ formalizes 


T Because this relation is untyped, its analogue for DDC is similar. For each lemma 
below, we include a reference to the location in the Coq development where it may 
be found for the dependent system. 
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BF aveb (Indexed Indistinguishability) 
SGEQ-APP 
SGEQ-VAR SGEQ-ABS PF by ~e be 
x : loin lo <£ x: LF bi ~e be PF a ~e a 
Pleuan~exwr Bt Ax: A.bı ~e Ax: A. be PF bı a ~e be ag 
SGEQ-BIND 
SGEQ-RETURN BE ai ~e a 
BH a ~ a B, x: lo VLE bi ~e be 
Btn”? aren a PF bind” z = a in bı ~e bind’? x = az in bz 


£ 
B H a1 ~ a2 


SEQ-LEQ SEQ-NLEQ 
lo< £ PFa ~e a2 a(lo < £) 
p Heo a ~ a2 PEO a ~ a2 


Fig. 2. Indexed indistiguishability for SDC (Excerpt) 


the idea discussed above: if fọ < @, then a, and az must be indistinguishable at 
L; otherwise, they may be arbitrary terms. 
Now, we explore some properties of the indistinguishability relation.” 

If we remove the second component from an indistinguishability relation, ® H 
a ~e b, we get a new judgment, ® F a: £, called grading judgment. Now, cor- 
responding to every indistinguishability rule, we define a grading rule where the 
indistinguishability judgments have been replaced with their grading counter- 
parts. Terms derived using these grading rules are called well-graded. We can 
show that well-typed terms are well-graded. 


Lemma 5 (Typing implies grading). If 2+ a :° A then |2| F a: £. 


Lemma 6 (Equivalence). Indexed indistinguishability at L is an equivalence 
relation on well-graded terms at £. 


The above lemma shows that indistinguishability is an equivalence relation. 
Observe that at the highest element of the lattice, T, this equivalence degenerates 
to the identity relation. 

Indistinguishability is closed under extended equivalence. The following is like a 
substitution lemma for the relation. 


Lemma 7 (Indistinguishability under substitution). If P, x:LF bı ~k be 
and BHE a ~ ag then BE bi{a/£} ~k bof a2/z}. 


With regard to the above lemma, consider the situation when —(¢ < k), for 
example, when £ = H and k = L. In such a situation, for any two terms aj, 
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and az, if Bx: LF by ~k b2, then BF by {a,/x} ~k b2{a2/x}. Let us work out 
a concrete example. For a typing derivation «:“ At b : Bool, we have, by 
lemmas 5 and 6,2: Ht b ~z b. Then, ØF b{a,/x} ~z b{a2/r}. This is almost 
non-interference in action. What’s left to show is that the indistinguishability 
relation respects the small step semantics, written a, ~ az. The small-step 
relation is standard call-by-name reduction. 


Theorem 1 (Non-interference). If ®+ a, ~k aj and a, ~ ag then there 
exists some as such that a, ~ as and PF az ~p as. 


Since the step relation is deterministic, in the above lemma, there is exactly 
one such aj that a} steps to. Now, going back to our last example, we see that 
b{a/z} and b{a2/x} take steps in tandem and they are L-indistinguishable 
after each and every step. Since the language itself is terminating, both the terms 
reduce to boolean values, values that are themselves [-indistinguishable as well. 
But the indistinguishability for boolean values is just the identity relation. This 
means that b{a,/x} and b{a2/x} reduce to the same value. 

The indistinguishability relation gives us a syntactic method of proving non- 
interference for programs derived in SDC. Essentially, we show that a user with 
low-security clearance cannot distinguish between high security values just by 
observing program behavior. 

Next, we show that SDC is no less expressive than the terminating fragment of 
DCC. 


3.4 Relation with Sealing Calculus and Dependency Core Calculus 


SDC is extremely similar to the sealing calculus Al of Shikuma and Igarashi [29]. 
Like SDC, Al has a label on the typing judgment.® But unlike SDC, Al uses 
standard ungraded typing contexts I’. Both the calculi have the same types. As 
far as terms are concerned, there is only one difference. The sealing calculus has 
an unseal term whereas SDC uses bind. We present the rules for sealing and 
unsealing terms in A! below.® 


SEALING- UNSEAL 
SEALING-SEAL Teka T° A 
Pea: A lo <£ 


Tey a:! TA T H unseal’a :’ A 


Shikuma and Igarashi [29] have shown that Al is equivalent to DCCpe, an 
extension of the terminating fragment of DCC. Therefore, we compare SDC to 
DCC by simulating A! in SDC. For this, we define a translation *, from A!) to 
SDC. Most of the cases are handled inductively in a straightforward manner. 


For unseal, we have, unseal‘a := bind‘ z = Tin z. 
8 Note that our labels correspond to observer levels of [29], which can be viewed as a 


lattice. 
° We take the liberty of making small cosmetic changes in the presentation. 
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With this translation, we can give a forward and a backward simulation 
connecting the two languages. The reduction relation ~ below is full reduction 
for both the languages, the reduction strategy used by Shikuma and Igarashi 
[29] for AU. Full reduction is a non-deterministic reduction strategy whereby a 
-redex in any sub-term may be reduced. 


Theorem 2 (Forward Simulation). If a~ a’ in Al, then @~ a! in SDC. 


Theorem 3 (Backward Simulation). For any term a in Al, ifa ~ b in 
SDC, then there exists a’ in Al such that b = a! and a~ a’. 


The translation also preserves typing. In fact, a source term and its target 
have the same type. Below, for an ordinary context I’, the graded context I 
denotes I” with the labels for all the variables set to £. 


Theorem 4 (Translation Preserves Typing). Jf + a:f A, then IH T £ 
A. 


The above translation shows that the terminating fragment of DCC can be 
embedded into SDC. Therefore SDC is at least as expressive as the terminating 
fragment of DCC. Further, SDC lends itself nicely to syntactic proof techniques 
for non-interference. This approach generalizes to more expressive systems, as 
we shall see in the next section, where we extend SDC to a general dependent 
dependency calculus. 


4 A Dependent Dependency Analyzing Calculus 


a, A, b, B ::= s | unit | Unit sorts and unit 
| Ux: A.B |x| Av: A.al| ab! dependent functions 
| Sa:°A.B | (a,b) | let (2°, y) = ain b dependent pairs 
| A+B | inj, a| inj, a | case aof bi; b2 disjoint unions 


Fig. 3. Dependent Dependency Calculus Grammar (Types and Terms) 


Here and in the next section, we present dependently-typed languages, with 
dependency analysis in the style of SDC. The first extension, called DDC! isa 
straightforward integration of labels and dependent types. This system subsumes 
SDC, and so can be used for the same purposes. Here, we show how it can be used 
to analyze run-time irrelevance. Then, in Section 5, we generalize this system to 
DDC, which allows definitional equality to ignore unnecessary sub-terms, thus 
also enabling compile-time irrelevance. We present the system in this way both 
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to simplify the presentation and to show that DDC! is an intermediate point 
in the design space. 

Both DDC! and DDC are pure type systems [5]. They share the same 
syntax, shown in Figure 3, combining terms and types into the same grammar. 
They are parameterized by a set of sorts s, a set of axioms A(s}, s2) which is a 
binary relation on sorts, and a set of rules R(s1, s2, s3) which is a ternary relation 
on sorts. For simplicity, we assume, without loss of generality, that for every sort 
sı, there is some sort s2, such that A(s1, s2).1° 

We annotate several syntactic forms with grades for dependency analysis. The 
dependent function type, written Iz: A.B, includes the grade of the argument 
to a function having this type. Similarly, the dependent pair type, written Vx: 
A.B, includes the grade of the first component of a pair having this type. 1 
We can interpret these types as a fusion of the usual, ungraded dependent types 
and the graded modality T’ A we saw earlier. In other words, Ia :‘ A.B acts 
like the type Hy : (T4 A).bindz = yinB and Sx: A.B acts like the type 
Sy:(T* A).bind z = yin B. Because of this fusion, we do not need to add the 
graded modality type as a separate form—we can define T? A as x: A.Unit. 
Using Hz: A.B instead of ITy:(T’ A).bind z = yin B has an advantage: the 
former allows z to be held at differing grades while type checking B and the 
body of a function having this I-type while the latter requires x to be held at 
the same grade in both the cases. We utilize this flexibility in Section 5. 


4.1 DDC' : I-types 


The core typing rules for DDC! appear in Figure 4. As in the simple type sys- 
tem, the variables in the context are labelled and the judgement itself includes 
a label 2. Rule DCT-VAR is similar to its counterpart in the simply-typed lan- 
guage: the variable being observed must be graded less than or equal to the level 
of the observer. Rule DCT-PI propagates the level of the expression to the sub- 
terms of the I-type. Note that this type is annotated with an arbitrary label 
lo: the purpose of this label & is to denote the level at which the argument to 
a function having this type may be used. 

In rule DCT-ABs, the parameter of the function is introduced into the con- 
text at level / V £ (akin to rule SDC-Binp). In rule DCT-App, the argument 
to the function is checked at level Zo V (akin to rule SDC-RETuRN). Note that 
the [T-type is checked at T in rule DCT-ABs. In DDC! , level T corresponds to 
‘compile time’ observers and motivates the superscript T in the language name. 

Rule DCT-Conv converts the type of an expression to an equivalent type. 
The judgment |Q| F A =r B is a label-indexed definitional equality relation 


10 This assumption does not lead to any loss in generality because given a pure type 
system (S’, A’, R’) that does not meet the above condition, we can provide an- 
other pure type system (S”, A”, R”), where S” = S’U {a} (given © ¢ S’) and 
A” = A' U {(s,0)|s € S”} and R” = R’, such that there exists a straightforward 
bisimulation between the two systems. 

11 We use standard abbreviations when z is not free in B: we write ‘A > B for Hz: A.B 
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Qa A (Typing) 
DCT-VaR DCT-P1 j 
DCT-TYPE QFA: sı 
lo <£ rOAER A(sı, s2) N, rf AFB! S2 R(s1, $2, 83) 
NHrÅA QE si 3 s2 Qt Tr: A.B: s3 
DCT-Conv 
DCT-Aps DCT-App Nka: A 
2,2: Abb! B Qo: Me: AB |2|- A=r B 
OQ (Ix: A.B):' 8 Na: A QEB:'s 
Qe Ag: Ab :* Tx: A.B Qt ba” £ B{a/z} Qa B 


Fig. 4. DDC! type system (core rules) 


instantiated to T. This relation is the closure of the indexed indistinguishability 
relation (Section 3.3) under small-step call-by-name evaluation. When instanti- 
ated to T, the relation degenerates to 6-equivalence. So the rule DCT-Conv is 
essentially casting a term to a 6-equivalent type; however, in the next section, 
we utilize the flexibility of label-indexing to cast a term to a type that may not 
be 6-equivalent. Also, note that the equality relation itself is untyped. As such, 
we need the third premise to guarantee that the new type is well-formed. 


4.2 DDC' : ¥-types 


The language DDC" includes X types, as specified by the rules below. 


DCT-WSIGMA DCT-WPaIR 
AEA? sy Dea: A 
Qc ALB s R(s1, $2, $3) Qh bob: Bla/x} Qt Sa:A.B:" s 
QE Sa:A.B £ s3 Qt (a,b) Sz: A.B 


Like I-types, X-types include a grade that is not related to how the bound 
variable is used in the body of the type. The grade indicates the level at which 
the first component of a pair having the X-type may be used. In rule DCT- 
WPAIR, we check the first component a of the pair at a level raised by o, the 
level annotating the type, akin to rule SDC-RETURN. The second component b 
is checked at the current level. 


and ‘A x B for Xx: A.B. 
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DCT-LETPAIR 
NF a:f Xr:® A.B 
Dat" A, y£ BE oP Ol (asa iat N, z:! (Xr:®A.B)FO:' s 


N- let (x,y) = ainc:’ C{a/z} 


The rule DCT-LETPAIR eliminates pairs using dependently-typed pattern 
matching. The pattern variables z and y are introduced into the context while 
checking the body c. Akin to rule SDC-BIND, the level of the first pattern 
variable, x, is raised by lọ. The result type C is refined by the pattern match, 
informing the type system that the pattern (z®, y) is equal to the scrutinee a. 

Because of this refinement in the result type, we can define the projection 
operations through pattern matching. In particular, the first projection, mio a := 
let (x,y) = ain z while the second projection, mPa := let (z, y) = ain y. 
These projections can be type checked according to the following derived rules: 


DCT-PROJ1 DCT-PROJ2 
Qa’ Sr: A.B lo <£ NF a:f SxrA.B 
Qe ava: A Qt has? Bin? aje} 


Note that the derived rule DCT-PROJ1 limits access to the first component 
through the premise lọ < @, akin to rule SEALING-UNSEAL. This condition makes 
sense because it aligns the observability of the first component of the pair with 
the label on the X-type. 


4.3 Embedding SDC into DDC! 
Here, we show how to embed SDC into DDC’. 

We define a translation function, 7, that takes the types and terms in SDC 
to terms in DDC". For types, the translation is defined as: A > B := +A > B, 
Ax B := AxB and T’ A := Xr: A.Unit. For terms, the translation is straight- 


forward except for the following cases: 7° a := (a, unit) and bind‘ z = ain b := 

let (xf, y) = @in b, where y is a fresh variable. By lifting the translation to 

contexts, we show that translation preserves typing. 

Theorem 5 (Trans. Preserves Typing). If QH a : A, then NEGA. 
Next, assuming a standard call-by-name small-step semantics for both the 

languages, we can provide a bisimulation. 

Theorem 6 (Forward Simulation). If a ~ a’ in SDC, then @ ~ a’ in 

DDC . 

Theorem 7 (Backward Simulation). For any term a in SDC, if @~ b in 

DDC! , then there exists a’ in SDC such that b = a’ and a ~ a’. 


Hence, SDC can be embedded into DDC’, preserving meaning. As such, 
DDC can analyze dependencies in general. 
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4.4 Run-time Irrelevance 


Next, we show how to track run-time irrelevance using DDC". We use the two 
element lattice {L, T} with L < T such that L and T correspond to run-time 
relevant and run-time irrelevant terms respectively. So, we need to erase terms 
marked with T. However, we first define a general indexed erasure function, |- |e, 
on DDC! terms, that erases everything an ¢-user should not be able to see. The 
function is defined by straightforward recursion in most cases. For example, 
|z]e := z and |r: A.B\e:= Mr: |Ale.| Ble and |Xxr.b|¢ := Nz.| db]. 
The interesting cases are: 

|b af Je := (Lb]e Lajf?) if fo < £ and ([b|, unit”) otherwise, 

|(a, b) Je := (La]f, |b Je) if lo < land (unit, |b]e) otherwise. 

They are so defined because if =(€9 < £), an ¢-user should not be able to see a, 
so we replace it with unit. 

This erasure function is closely related to the indistinguishability relation, 
we saw in Section 3.3, extended to a dependent setting. (This definition ap- 
pears in the extended version of this paper [12].) The erasure function maps the 
equivalence classes formed by the indistinguishability relation to their respec- 
tive canonical elements. We have verified the following lemmas using the Coq 
proof assistant. Footnotes mark the file and lemma name of the corresponding 
mechanized results. 


Lemma 8 (Canonical Element!?). If + a, ~g a, then |a le = laze- 
Further, a well-graded term and its erasure are indistinguishable. 


Lemma 9 (Erasure Indistinguishability!*). If 6+ a: L, then BF a ~e 


Lale. 


Next, we can show that erased terms simulate the reduction behavior of their 
unerased counterparts. 


Lemma 10 (Erasure Simulation). If ®t a : £ and a ~ b, then |ale~ 
|b|e. Otherwise, if a is a value, then so is |ale. 


This lemma follows from Lemma 9 and the non-interference theorem (Theo- 
rem 1). Therefore, it is safe to erase, before run time, all sub-terms marked with 
HES 

This shows that we can correctly analyze run-time irrelevance using DDC": 
However, supporting compile-time irrelevance requires some changes to the sys- 
tem. We take them up in the next section. 


12 13 


erasure.v:Canonical_element. erasure.v:Erasure_Indistinguishability 


14 erasure.v: Step_erasure ,Value_erasure 
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5 DDC: Run-time and Compile-time Irrelevance 


5.1 Towards Compile-time Irrelevance 


Recall that terms which may be safely ignored while checking for type equality 
are said to be compile-time irrelevant. In DDC! , the conversion rule DCT- 
Conv checks for type equality at T. 


DCT-Conv 
Nea A |N\FA=TB OQEB:' 5 


Qa! B 


The equality judgment used in this rule F a =7 b is an instantiation of the 
general judgment F a =, b, which is the closure of the indistinguishability re- 
lation at £ under $-equivalence. When £ is T, indistinguishability is just identity. 
As such, the equality relation at T degenerates to standard -equivalence. So, 
rule DCT-Conv does not ignore any part of the terms when checking for type 
equality. 

To support compile-time irrelevance then, we need the conversion rule to 
use equality at some grade strictly less than T so that T-marked terms may be 
ignored. For the irrelevance lattice £z, the level C can be used for this purpose. 
For any other lattice £, we can add two new elements, C and T, above every 
other existing element, such that £ < C < T, and thereafter use level C for this 
purpose. So, for any lattice, we can support compile-time irrelevance by equating 
types at C. 

Referring back to the examples in Section 2.2, note that for phantom : 
Nat! -> Type, we have phantom 0'=c phantom 1'. With this equality, we can 
type-check idp : phantom 0' -> phantom 1' = À x. x, even without knowing 
the definition of phantom. 

Now, observe that in rule DCT-Conv, the new type B is also checked at 
T. If we want to check for type equality at C, we need to make sure that the 
types themselves are checked at C. However, checking types at C would rule 
out variables marked at T from appearing in them. This would restrict us from 
expressing many examples, including the polymorphic identity function. 

To move out of this impasse, we take inspiration from EPTS [20,21]. The 
key idea, adapted from [20], is to use a judgment of the form CA Qt a:° A 
instead of a judgment of the form R F a:' A. The operation C A R takes the 
point-wise meet of the labels in the context 2 with C, essentially reducing any 
label marked as T to C, making it available for use in a C-expression. This 
operation, called truncation, makes T marked variables available at C. Other 
systems also use similar mechanisms for tracking irrelevance — for example, we 
can see a relation between this idea and analogous ones in [27] and [3]. In these 
systems, “context resurrection” operation makes proof variables and irrelevant 
variables in the context available for use, similar to how C A^ 92 makes T-marked 
variables in the context available for use. 
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5.2 DDC: Basics 


Next, we design a general dependency analyzing calculus, DDC, that takes ad- 
vantage of compile-time irrelevance in its type system. DDC is a generalization 
of DDC! and EPTS?® [20]. When C equals T, DDC degenerates to DDC', 
that does not use compile-time irrelevance. When C equals L, DDC degener- 
ates to EPTS®, that identifies compile-time and run-time irrelevance. A crucial 
distinction between EPTS® and DDC is that while the former is tied to a two el- 
ement lattice, the latter can use any lattice. Thus, not only can DDC distinguish 
between run-time and compile-time irrelevance, but also it can simultaneously 
track other dependencies. 


Aka A (DDC core typing rules) 
T-PI 
T-VAR QEAS s 
lo <£ T-TYPE Qc ARB? s 
r2°AEQ L<c <C A(sı, s2) R(s1, $2, 83) 
QealA QE 43° s Qe Ta: AB: s3 
T-ConvC 
T-ABSC | i T-APPC Qa A 
Qc Arby B Qo: Tar: AB ICA Q|\E A=c B 
Qt (Hx: A.B) :" s eat A NRI- B: s 
N H Az: Ab Te: A.B NH ba”: Bla/z} Qa: B 
Aika: A (Truncate at T) 
CT-LEQ CT-Top 
Nka A L<C CAQDKaS A C< 
Qika: A Rika: A 


Fig. 5. Dependent type system with compile-time irrelevance (core rules) 


The core typing rules of DDC appear in Figure 5. Compared to DDC!, 
this type system maintains the invariant that for any 2+ a :’ A, we have 
L < C. To ensure that this is the case, rule T-TYPE and rule T-VAR include 
this precondition. This restriction means that we cannot really derive any term 
at T in DDC. We can get around this restriction by deriving CA 2 F a:© A in 
place of QHa:' A. 

Wherever DDC! uses T as the observer level on a typing judgment, DDC 
uses truncation and level C instead. If DDC! uses some grade other than T 
as the observer level, DDC leaves the derivation as such. So a DDC! judgment 
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Qt a: A is replaced with a truncated-at-top judgment, Q l- a :“ A which can 
be read as: if 2 = T, use the truncated version C A Q+ a : A; otherwise use 
the normal version 2+ a: A, as we see in Figure 5. In the typing rules, uses of 
this new judgment have been highlighted in gray to emphasize the modification 
with respect to DDC'. 


5.3 II-types 


Rule T-Pi is unchanged. The lambda rule T-ABSC now checks the type at C 
after truncating the variables in the context to C. The application rule T-APPC 
checks the argument using the truncated-at-top judgment. Note that if 4o = T, 
the term a can depend upon any variable in 22. Such a dependence is allowed 
since information can always flow from relevant to irrelevant contexts. 

To see how irrelevance works in this system, let’s consider the definition and 
use of the polymorphic identity function. 


id : II x:'Type. x -> x 
id=A'x. Ay. y 


In DDC', the type I x: Type. x -> x is checked at T. However, here it 
must be checked at level C, which requires the premise x:°Type F x -> x :© 
Type. Note that if we used the same grade for the bound variable x in rule T-PI 
and rule T-ABSC, we would have been in trouble because variable x is compile- 
time relevant while we check the type, even though it is irrelevant in the term.!° 

Finally, observe that rule T-CONVC uses the definitional equality at C in- 
stead of T and that the new type is checked after truncation. 


5.4 /-types 


T-WPAIRC T-LETPAIRC 
Alka: A Qa? Sr” A.B 
RHFb:£ B{a/x} ig OMA ae B ea" C{(2®, y)/z} 
Qik Se:A.B:' s Qe) (2: A.B) IKC:' s 
Qt (a,b): S2:A.B Qt let (x,y) = ain c: C{a/z} 


We also need to modify the typing rules for X types accordingly. In particular, 
when we create a pair, we check the first component using the truncated-at-top 
judgment. This is akin to how we check the argument in rule T-APPC. Note that 
if lo = T, the first component a is compile-time irrelevant. In such a situation, 
we cannot type-check the second projection since it requires the first projection, 
as we see in the derived!® projection rules below. So pairs having type Xz: A.B 


15 This is why we fuse the graded modality with the dependent types. If they were 
separated, and we had to bind here, it would be a problem since a dependent function 
and its type have different restrictions vis-a-vis the bound variable. 

16 strong_exists.v:T_wproji,T_wproj2 
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can only be eliminated via pattern matching if B mentions x. However, pairs 
having type Xz:CA.B can be eliminated via projections. 

For example, for an output of the filter function, ys :Xm:CNat. Vec m Bool, 
we have mı ys :© Nat and m2 ys : Vec (mı ys) Bool. Note that (mı ys) is visible 
at C and is used in the type of (72 ys). We can substitute (mı ys) for min (Vec m 

Bool) because m :CNat H Vec m Bool :© Type. However, (mı ys) cannot be used 
at L, so it will be erasable then. 


T-PROJ1C T-PROJ2C 
NF a:f Sr® A.B lo <£ NF a:t Sr” A.B lo < C 
Qe ra A Qt wa:? Bink a/r} 


5.5 Non-interference 


DDC satisfies an analogous noninterference theorem to the one presented for 
SDC, using suitable definitions for the grading relation, written ® F a: £, and 
indexed indistiguishability, written P F bı ~¢ b2. The complete definition of these 
judgements appears in the extended version of this paper [12]. 


Lemma 11 (Typing implies grading!”). If QF a :° A then |Q\ a: £. 


Lemma 12 (Equivalence!®). Indexed indistinguishability at £ is an equiva- 
lence relation on well-graded terms at £. 


Lemma 13 (Indistinguishability under substitution’). If 8, x: + bı ~k 
bo and ® ae a ~ ag then BF bi{a,/x} ~k bof{ag/ax}. 


Theorem 8 (Non-interference for DDC?°). If + a ~k al and a ~ a 
then there exists some a, such that a, ~ as and BPF az ~p ah. 


5.6 Consistency of Equality 


The equality relation of DDC incorporates compile-time irrelevance. To show 
that the type system is sound, we need to show that the equality relation is 
consistent. Consistency of definitional equality means that there is no derivation 
that equates two types having different head forms. For example, it should not 
equate Nat with Unit. 

Note that if T inputs can interfere with C outputs, the equality relation 
cannot be consistent. To see why, let x:' At b :° Bool and for ay, a2 : A, let 
the terms b{a,/x} and b{a2/z} reduce to True and False respectively. Now, 
(AT zif b then Nat else Unit) a, ' =c (A! z.if b then Nat else Unit) a2! . But 
then, by G-equivalence Nat =ç Unit. 

To prove consistency, we construct a standard parallel reduction relation 
and show that this relation is confluent. Thereafter, we prove that if two terms 


. typing.v:Typing-Grade 29 geq.v:GEq_ref1,GEq_symmetry ,GEq-trans 
19 subst . v :CEq-GEq-equality-substitution ae geq.v:CEq_GEq_respects_Step 
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are definitionally equal at £, then they are joinable at £, meaning they reduce, 
through parallel reduction, to two terms that are indistinguishable at 4. Next, 
we show that joinability at £ implies consistency. Therefore, we conclude that 
for any £, the equality relation at @ is consistent. This implies that the equality 
relation at C, that ignores sub-terms marked with T, is sound. Hence, DDC 
tracks compile-time irrelevance correctly. Note that DDC can track run-time 
irrelevance the same way as DDC'. 

We formally state consistency in terms of head forms, i.e. syntactic forms 
that correspond to types such as sorts s, Unit, IIx: A.B, etc. 


Theorem 9 (Consistency”!). If- a=, b, and a and b both are head forms, 
then they have the same head form. 


5.7 Soundness theorem 


DDC is type sound and we have checked this and other results using the Coq 
proof assistant. Below, we give an overview of the important lemmas in this 
development. 

The properties below are stated for DDC, but they also apply to DDC! 
since DDC degenerates to DDC! whenever C = T. First, we list the proper- 
ties related to grading that hold for all judgments: indexed indistinguishability, 
definitional equality, and typing. (We only state the lemmas for typing, their 
counterparts are analogous.) These lemmas are similar to their simply-typed 
counterparts in Section 3.2. 


Lemma 14 (Narrowing”’). [If Q' a:f A and 2’ <Q, then Mb a:f A 
Lemma 15 (Weakening?’). If Qi, Qo F a:f A then Ri, R, Ro F a: A. 


Lemma 16 (Restricted upgrading”). If Q1, x: A, Qo F b: B and4 < £ 
then Ni, x: A, R H bf B. 


Next, we list some properties that are specific to the typing judgment. For 
any typing judgment in DDC, the observer grade £ is at most C. Further, the 
observer grade of any judgment can be raised up to C. 


Lemma 17 (Bounded by C”). If Qt a:f A then l< C. 


Lemma 18 (Subsumption?®). If Q + a : A andl < k andk < C then 
Nak A 


Note that we don’t require contexts to be well-formed in the typing judgment; 
we add context well-formedness constraints, as required, to our lemmas. The 
following lemmas are true for well-formed contexts. A context {2 is well-formed, 
expressed as + 9, iff for any assumption zx :* A in R, we have Q’ IF A:T s, where 
§2' is the prefix of 22 that appears before the assumption. 


ar consist.v:DefEq_Consistent 22 narrowing.v:Typing-narrowing 


weakening.v:Typing weakening 24 pumping.v:Typing_pumping 
2p pumping.v:Typing_leq_C 20 typing.v:Typing-subsumption 


23 
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Lemma 19 (Substitution?’). If Qi, 2: A, Rə F b:* B and Ny and Qı IF 
a: A then Q, Q2{a/t}+ b{a/z} * B{a/zx} 


Next, if a term is well-typed in our system, the type itself is also well-typed. 
Lemma 20 (Regularity”’). If Qt a:f A and Q then QIFA:! s. 


Finally, we have the two main lemmas proving type soundness. 


Lemma 21 (Preservation?’). If Q + a: A and+ Q and a ~ a’, then 
oats A. 


Lemma 22 (Progress’’). If ØH a: A then either a is a value or there exists 
some a’ such that a ~ a’. 


Hence, DDC is type sound. We have seen earlier that it tracks run-time and 
compile-time irrelevance correctly. 

DDC is parameterized by a generic pure type system and a generic lattice. 
When the parameterizing pure type system is strongly normalizing, such as the 
Calculus of Constructions, type-checking is decidable. In the next section, we 
provide a demonstration. 


6 Type Checking 


As a pure type system, not all instances of DDC admit decidable type checking. 
For example, in the presence of the type:type axiom, the system includes non- 
terminating computations via Girard’s paradox. As as a result, we cannot decide 
equality in that system, so type checking will be undecidable. However, if the 
sorts, axioms and rules are chosen such that the language is strongly normaliz- 
ing, then we can define a decidable type checking algorithm. This algorithm is 
standard, but relies on a decision procedure for the equality judgement. 

Our consistency proof, described in Section 5.6, gives us a start. This proof 
uses an auxiliary binary relation called joinability, which holds when two terms 
can use multiple steps of parallel reduction to reach two simpler terms that 
differ only in their unobservable components. Joinability and definitional equality 
induce the same relation on DDC terms. We can show that two DDC terms 
are definitionally equal if and only if they are joinable?!, which means that a 
decision procedure based on joinability will be sound and complete for DDC’s 
labeled definition of equivalence. 

Therefore, the decidability of type checking reduces to showing strong nor- 
malization. If we select the sorts, axioms and rules of DDC to match those of 
the Calculus of Constructions [5], we believe that this result holds, but leave a 
direct proof for future work. However, by translating this instance of DDC to 
ICC*, we can show that a sublanguage of this instance is strongly normalizing. 


2r typing.v:Typing-substitution_CTyping 26 typing.v:Typing_regularity 


typing.v:Typing_preservation 3 progress.v:Typing_progress 


31 consist.v: DefEq_Joins , Joins DefEq 
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ICC* [6], is a version of the Implicit Calculus of Constructions with annotations 
that support decidable type checking, but because it includes only (relevant 
and irrelevant) [7-types, so we must restrict our attention to the corresponding 
fragment of DDC. 

We define the following translation, written ~, that converts DDC terms to 
ICC* terms. The key parts of this translation map arguments labeled C' and 
below to relevant arguments, and those labeled greater than C, such as T, to 
irrelevant arguments.°? 


I(x: A).B ifl < 
T=zg S=s U2 A.B = ee i <C 
IH[|x:A].B otherwise 
Ab A(z: A).b ZENG pu D0) ESE 
Aļz:A].b otherwise b [a] otherwise 
Note that ICC* compares terms for equality after an erasure operation, writ- 
ten :*, that removes all irrelevant arguments. Now, we can show that the above 
translation preserves definitional equality and typing. Here, 2 denotes N2 with 
the labels at the variable bindings omitted. 


Lemma 23 (Translation preservation). If @+ A =c B, then A* Ssn B*. 
If RF a:f A, then QH ù: A. 


Next, observe that because 8-reductions are preserved by the translation, any 
parallel reduction in DDC between terms a and b at level C, where a Æ b, would 
correspond to a sequence of reduction steps @ 5 b in ICC*. That means that 
an infinite sequence of parallel reductions ap, a1, ..., where each term differs 
from the previous, corresponds to an infinite sequence of reductions do, a, ...in 
ICC*. Therefore, as all well-typed ICC* terms are strongly normalizing, we can 
conclude that this is so for this instance of DDC. 


Non-terminating instances of DDC. For pure type systems that are not strongly 
normalizing, such as the type:type language, there is an alternative approach to 
developing a calculus with decidable type checking, following Weirich et al. [35]. 
The key idea is to develop an annotated version of DDC that book-keeps addi- 
tional information from typing and equality derivations. In such an annotated 
version, the conversion rule would include an explicit coercion annotation that 
witnesses the equality between the concerned types, thus avoiding the need for 
normalization. 


32 The syntax of ICC* uses parentheses to indicate usual (relevant) arguments and 
square brackets to indicate arguments that are irrelevant at both run time and 
compile time. 
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7 Discussions and Related Work 


7.1 Irrelevance in Dependent Type Theories 


Overall, compile-time and run-time irrelevance is a well-studied topic in the 
design of dependent type systems. In some systems, the focus is only on support 
for run-time irrelevance: see [18,4,8,19,20,32]. In other systems, the focus is on 
compile-time irrelevance: see [27,3]. Some systems support both, but require 
them to overlap, such as [6,21,35,24]. The system of Mooon et al. [23] does not 
require them to overlap but their type system does not make use of compile-time 
irrelevance in the conversion rule. 

To compare, system DDC! , presented here, can support run-time irrelevance 
only and is similar to the core language of Tejiščák [32]. However, note that 
DDC" can track dependencies in general while the system in [32] tracks run- 
time irrelevance alone. DDC, on the other hand, is the only system that we are 
aware of that tracks run-time and compile-time irrelevance separately and makes 
use of the latter in the conversion rule. Further, DDC tracks these irrelevances in 
the presence of strong X-types with erasable first components, something which, 
to the best of our knowledge, no prior work has been able to. 

Prior work has identified the difficulty in handling strong X-types with 
erasable first components in a setting that tracks compile-time irrelevance. Abel 
and Scherer [3] point out that strong irrelevant X-types make their theory in- 
consistent. Similarly, EPTS® [21] cannot define the projections for pairs having 
such X-types. The reason behind this is that EPTS® is hard-wired to work with 
a two-element lattice which identifies compile-time and run-time irrelevance. As 
such, projections from such pairs lead to type unsoundness. For example, con- 
sidering the first components to be run-time irrelevant, the pairs (Int, unit) 
and (Bool, unit) are run-time equivalent. Since EPTS® identifies run-time and 
compile-time irrelevance, these pairs are also compile-time equivalent. Then, tak- 
ing the first projections of these pairs, one ends up with Int and Bool being 
compile-time equivalent. We resolve this problem by distinguishing between run- 
time and compile-time irrelevance, thus requiring a lattice with three elements. 

Next, we compare our work with existing literature with respect to the equal- 
ity relation. We analyze compile-time irrelevance to enable the equality relation 
to ignore unnecessary sub-terms. However, since our equality relation is untyped, 
we cannot include type-dependent rules in our system, such as 7-equivalence for 
the Unit type. Several prior works on irrelevance [19,6,21,32] use an untyped 
equality relation. However, some prior work, such as [27,3], do consider compile- 
time irrelevance in the context of typed-directed equality. But such systems re- 
quire irrelevant arguments to functions appear only irrelevantly in the codomain 
type of the function, thus ruling out several examples including the polymorphic 
identity function. 


7.2 Quantitative Type Systems 


Our work is closely related to quantitative type systems [26,15,9,18,4,25,2,10,23]. 
Such systems provide a fine-grained accounting of coeffects, viewed as resources, 
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for example, variable usage, linearity, liveness, etc. A typical judgment from a 
quantitative type system [10] may look like: 


z:' Bool, y:' Int, z:° Bool H if ztheny + 1lelsey—1 :' Int 


The variable x is used once in the condition, the variable y is used once in each 
of the branches while the variable z is not used at all. As such, they are marked 
with these quantities in the context. 

This form of judgment is very similar to our typing judgments with quan- 
tities appearing in place of levels. However, there is a crucial difference: to the 
right of the turnstile, while any level may appear in our judgments, only the 
quantity 1 can appear in typing judgments of quantitative systems. A quanti- 
tative system that allows an arbitrary quantity to the right of the turnstile is 
not closed under substitution [18,4]. As such, quantitative systems are tied to 
a fixed reference while our systems can view programs from different reference 
levels. This difference in form results from the difference in the purposes the two 
kinds of systems serve: quantitative systems count while our systems compare. 
Counting requires a fixed standard or reference whereas comparison does not. 
Applications that require counting, like linearity tracking, are handled well by 
quantitative systems while applications that require comparison, like ensuring 
secure information flow, are handled well by systems of our kind. 

From a type-theoretic standpoint, in general, quantitative systems cannot 
eliminate pairs through projections. This is so because there is no general way 
to split the resources of the context that type-checks a pair. Eliminating pairs 
through projections is straightforward in our systems because the grade on the 
typing judgment can control where the projections are visible. 


7.3 Dependency Analysis and Dependent Type Theory 


Dependency analysis and dependent type theories have come together in some 
existing work. 

Like our system, Prost [28] extends the A-cube so that it may track depen- 
dencies. However, unlike our system, this work uses sorts to track dependencies. 
It is inspired by the distinction between sorts in the Calculus of Constructions 
where computationally relevant and irrelevant terms live in sorts Set and Prop 
respectively. As Mishra-Linger [21] points out, such an approach ties up two 
distinct language features, sorts and dependency analysis, which can be treated 
in a more orthogonal manner. 

Bernardy and Guilhem’s type-theory in color [7] is very related to our work. 
This type-theory uses colors to erase terms while we use grades. Colors and 
grades both form a lattice structure and their usage in the respective type sys- 
tems are quite similar. However, in type-theory in color, internalized parametric- 
ity is used to reason about erasure; so it is important that the type-theory be 
logically consistent. Our work does not rely on the normalizing nature of the 
theory; we take a direct route to analyzing erasure. 

Like our work, Lourengo and Caires [17] track information flow in a dependent 
type system. But Lourenço and Caires [17] focus on more imperative features, 


A Dependent Dependency Calculus 427 


like modeling of state while we focus on irrelevance. A distinguishing feature of 
their system is that they allow security labels to depend upon terms, something 
that we don’t attempt here. 


8 Conclusion 


We started with the aim of designing a dependent calculus that can analyze 
dependencies in general, and run-time and compile-time irrelevance in partic- 
ular. Towards this end, we designed a simple dependency calculus, SDC, and 
then extended it to two dependent calculi, DDC! and DDC. DDC! can track 
run-time irrelevance while DDC can track both run-time and compile-time irrel- 
evance along with other dependencies. 

In future, we would like to explore how irrelevance interacts with other de- 
pendencies. We also want to explore whether our systems can be integrated with 
existing graded type systems, especially quantitative type systems. Yet another 
interesting direction for research is that how they compare with graded effect 
systems. 

Our work lies in the intersection of dependency analysis and irrelevance track- 
ing in dependent type systems. Both these areas have rich literature of their 
own. We hope that the connections established in this paper will be mutually 
beneficial and help in the future exploration of dependencies and irrelevance in 
dependent type systems. 
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Abstract. Polarization of types in call-by-push-value naturally leads 
to the separation of inductively defined observable values (classified by 
positive types), and coinductively defined computations (classified by 
negative types), with adjoint modalities mediating between them. Taking 
this separation as a starting point, we develop a semantic characterization 
of typing with step indexing to capture observation depth of recursive 
computations. This semantics justifies a rich set of subtyping rules for 
an equirecursive variant of call-by-push-value, including variant and lazy 
records. We further present a bidirectional syntactic typing system for 
both values and computations that elegantly and pragmatically circum- 
vents difficulties of type inference in the presence of width and depth 
subtyping for variant and lazy records. We demonstrate the flexibility of 
our system by systematically deriving related systems of subtyping for 
(a) isorecursive types, (b) call-by-name, and (c) call-by-value, all using a 
structural rather than a nominal interpretation of types. 


Keywords: Call-by-push-value - Semantic Typing - Subtyping 


1 Introduction 


Subtyping is an important concept in programming languages because it simul- 
taneously allows more programs to be typed and more precise properties of 
programs to be expressed as types. The interaction of subtyping with parametric 
polymorphism and recursive types is complex and despite a lot of progress and 
research, not yet fully understood. 

In this paper we study the interaction of subtyping with equirecursive types in 
call-by-push-value [53, 54], which separates the language of types into positive and 
negative layers. This polarization elegantly captures that positive types classifying 
observable values are inductive, while negative types classifying (possibly recur- 
sive) computations are coinductive. It lends itself to a particularly simple semantic 
definition of typing using a mixed induction/coinduction [9, 13, 22]. From this 
definition, we can immediately derive a form of semantic subtyping [15, 35, 36]. 
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Concretely, we realize the mixed induction/coinduction via step-indexing and 
carry out our metatheory in Brotherston and Simpson’s system CLKID” of 
circular proofs [14]. This includes a novel proof that syntactic versions of typing 
and subtyping are sound with respect to our semantic definitions. While we also 
conjecture that subtyping is precise (in the sense of [55]), we postpone this more 
syntactic property to future work. 

Because our foundation is call-by-push-value, a paradigm that synthesizes call- 
by-name and call-by-value based on the logical principle of polarization, we obtain 
several additional results in relatively straightforward ways. For example, both 
width and depth subtyping for variant and lazy records are naturally included. 
Furthermore, following Levy’s interpretation of call-by-value and call-by-name 
functional languages into call-by-push-value, we extract subtyping relations and 
algorithms for these languages and prove them sound and complete. We also note 
that we can directly interpret the isorecursive types in Levy’s original formulation 
of call-by-push-value [53]. 

We further provide a systematic notion of bidirectional typing that avoids some 
complexities that arise in a structural type system with variant and lazy records. 
The resulting decision procedure for typing is quite precise and suggests clear 
locations for noting failure of typechecking. The combination of equirecursive call- 
by-push-value with bidirectional typing achieves some of the goals of refinement 
types [24, 34], which fit a structural system inside a generative type language. 
Here we have considerably more freedom and less redundancy. However, we do 
not yet treat intersection types or polymorphism. 

We summarize our main contributions: 

1. A simple semantics for types and subtyping in call-by-push-value, interpreting 
positive types inductively and negative types coinductively, realized via step 
indexing (Sections 3 and 4) 

2. A new decidable system of equirecursive subtyping for call-by-push-value 
including width and depth subtyping for variant and lazy records (Section 4) 

3. A novel application of Brotherston and Simpson’s system CLKID” [14] of 
circular proofs to give a particularly elegant and flexible soundness proof for 
subtyping (Section 5) 

4. A system of bidirectional typing that captures a straightforward and precise 
typechecking algorithm (Section 6), whose implementation is provided as a 
publicly available artifact [50] 

5. A simple interpretation of Levy’s original isorecursive types for call-by-push- 
value [53] into our equirecursive setting (Section 7) 

6. Subtyping rules for call-by-name and call-by-value, derived via Levy’s trans- 
lations of such languages into call-by-push-value (Section 8) 


These are followed by a discussion of related work and a conclusion. Additional 
material and proofs are provided in an appendix of the extended paper version [49]. 


2 Equirecursive Call-by-Push-Value 


Call-by-push-value [53, 54] is characterized by a separation of types in positive T” 
and negative o~ layers, with shift modalities going back and forth between them. 
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The intuition is that positive types classify observable values v while negative 
types classify computations e. 


ost OTa |1| O{f: t heer | Lo | tt 


or tart >07 | Slt o7 jeer | tr? | s7 


The usual binary product T x o splits into two: Tt @ a for eager, observable 
products inhabited by pairs of values, and &{@: o; }eex for lazy, unobservable 
records with a finite set L of fields we can project out. Binary sums are also gen- 
eralized to variant record types @{¢: T7 }eet-* These are not just a programming 
convenience but allow for richer subtyping: lazy and variant record types support 
both width and depth subtyping, whereas the usual binary products and sums 
support only the latter. For example, width subtyping means that @{false: 1} 
is a subtype of bool* = @{false: 1, true: 1}, while 1 would not be a subtype of 
the usual binary 1+ 1. Neither is 1 a subtype of bool*, demonstrating the utility 
of variant record types with one label, such as @{false: 1}. Similar examples 
exist for lazy record types. This way, we recover some of the benefits of refinement 
types without the syntactic burden of a distinct refinement layer. 

The shift |o~ is inhabited by an unevaluated computation of type o~ (a 
“thunk” ). Conversely, the shift t7+ includes a value as a trivial computation (a 
“return” ). Levy [53] writes U B instead of |o~ and F A instead of t7*. 

Finally, we model recursive types not by explicit constructors wat.t? and 
va—.a~ but by type names tt and s~ which are defined in a global signature 
X. They may mutually refer to each other. We treat these as equirecursive (see 
Section 3) and we require them to be contractive, which means the right-hand 
side of a type definition cannot itself be a type name. Since we would like 
to directly observe the values of positive types, the definitions of type names 
tt = T% are inductive. This allows inductive reasoning about values returned 
by computations. On the other hand, negative type definitions s~ = o~ are 
recursive rather than coinductive in the usual sense, which would require, for 
example, stream computations to be productive. Because we do not wish to 
restrict recursive computations to those that are productive in this sense, they 
are “productive” only in the sense that they satisfy a standard progress theorem. 

Next, we come to the syntax for values v of a positive type and computations 
e of a negative type. Variables x always stand for values and therefore have a 
positive type. We use 7 to stand for labels, naming fields of variant records or 
lazy records, where j - v injects value v into a sum with alternative labeled j and 
e.j projects field e out of a lazy record. When we quantify over a (always finite) 
set of labels we usually write £ as a metavariable for the labels. 


v == g | (v1, v2) | O | j- v | thunk e 
e i= \Ax.e | ev | {0=echcer | ej | return v | let return x = e1 in e2 | f 
| match v ((x, y) = e) | match v (() = e) | match v (£ - xe = ee)eeL | force v 


Yrsa rT Es ao |S fie =e 


4 We borrow the notation @ from linear logic even though no linearity is implied. 
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In order to represent recursion, we use equations f = e in the signature where f is 
a defined expression name,which we distinguish from variables, and all equations 
can mutually reference each other. An alternative would have been explicit fixed 
point expressions fix f.e, but this mildly complicates both typing and mutual 
recursion. Also, it seems more elegant to represent all forms of recursion at the 
level of types and expressions in the same manner. We also choose to fix a type 
for each expression name in a signature. Otherwise, each occurrence of f in an 
expression could potentially be assigned a different type, which strays into the 
domain of parametric polymorphism and intersection types. 

Following Levy, we do not allow names for values because this would add an 
undesirable notion of computation to values, and, furthermore, circular values 
would violate the inductive interpretation of positive types. As discussed in [53, 
Chapter 4], they could be added back conservatively under some conditions. 


2.1 Dynamics 


For the operational semantics, we use a judgment e +> e’ defined inductively 
by the following rules which may reference a global signature X to look up the 
definitions of expression names f. In contrast, values do not reduce. The dynamics 
of call-by-push-value are defined as follows: 


ere’ 


(Az. e)u + [v/ale eve ev let return z = return v in e2 > [v/x]ez 


earel (j € L) ere’ 


let return = e1 in e2 ++ let return x =e in e2 {l=ecbecn.jroue ejm e.j 


match (v1, v2) ((x, y) => e) > [v1/az][v2/yle match () (() =>e)} e 
(j € L) fio seex 
match (j - v) (L: xe = ee)eer + [v/ajJe; force (thunk e) + e frre 


Note that some computations, specifically Ax. e, {€ = eg}eez, and return v, 
do not reduce and may be considered values in other formulations. Here, we call 
them terminal computations and use the judgment e terminal to identify them. 


Ax. e terminal {L = ee yeer terminal return v terminal 


We will silently use simple properties of computations in the remainder of 
the paper which follow by straightforward induction. 


Lemma 1 (Computation). 


1. Ife e ande e” then e =e" 
2. It is not possible that both e+ e' and e terminal. 
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2.2 Some Sample Programs? 


Example 1 (Computing with Binary Numbers). We show some example programs 
for binary numbers in “little endian” representation (least significant bit first) 
and in standard form, that is, without leading zeros. 

bin" = ®{e: 1, b0 : bin, b1 : bin} 

std’ = {e : 1, bO : pos, b1 : std} 

post =@{ bO : pos, b1 : std} 


We expect the subtyping relationships pos < std < bin to hold, because every 
positive standard number is a standard number, and every standard number is a 
binary number. According to our definition and rules in Sections 3 and 5 these 
will hold semantically as well as syntactically. 

We now show some simple definitions f : o7 =e. 


six : tpos = return bO- b1-b1-e- () 


The increment function on binary numbers implements the carry with a recursive 
call, which has to be wrapped in a let/return. 


inc : std > tpos 
= x. match x (e-u => return bl-e-u 
| bO - x’ = return b1 - a’ 
| b1- a’ = let return y’ = inca’ in return bO- y’) 


By subtyping, we also have inc : std > tstd, for example, but not inc : bin > fbin 
since bin £ std. However, the definition could be separately checked against this 
type, which points towards an eventual need for intersection types. 

The following incorrect version of the decrement function does not have the 
indicated desired type! 


deco : pos — tstd % incorrect! 
= Az. match x (bO - 2’ = let return y’ = deco x’ in return b1 - y’ 
| b1 - a’ = return bO- x’ ) 


The error here is quite precisely located by the bidirectional type checker (see 
Section 6): When we inject bO - x’ in the second branch it is not the case that 
x’ : pos as required for standard numbers! And, indeed, decgb1-e- () =* 
return bO - e- () which is not in standard form. On the other hand, the fact that 
a branch for e- u is missing is correct because the type pos does not have an 
alternative for this label. 

We can fix this problem by discriminating one more level of the input (which 
could be made slightly more appealing by a compound syntax for nested pattern 
matching). 


dec : pos— ‘std 
= Ax. match x (bO - 2’ => let return y’ = dec’ in return b1 - y’ 
| b1- a’ = match 2’ (e: u > returne-u 
| bO- z” = return bO- bO- x” 
| b1 - a” = return bO - b1- x” )) 


5 These examples and more are captured in our open access implementation artifact [50]. 
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Example 2 (Computing with Streams). We present an example of a type with 
mixed polarities: a stream of standard numbers with a finite amount of padding 
between consecutive numbers. Programmer’s intent is for the stream to be lazy 
and infinite, i.e., no end-of-stream is provided. But because we do not restrict 
recursion even a well-typed implementation may diverge and fail to produce 
another number. On the other hand, the padding must always be finite because 
the meaning of positive types is inductive. We present padded streams as two 
mutually dependent type definitions, one positive and one negative. Because our 
type definitions are equirecursive this isn’t strictly necessary, and we could just 
substitute out the definition of pstream—. 

For our example, we also define a subtype with zero padding, as forcing a 
single padding label none between any two elements could also be expressed. 


pstream— = (std & padding) 
paddingt = @{none : padding, some : | pstream} 


zstream— = Î(std ® ®{some : |zstream}) 


In zstream, we see the significance of variant record types with just one label: 
some. We exploit this in Section 7 to interpret isorecursive types into equirecursive 
ones. We have that zstream < pstream, which means we can pass a stream with 
zero padding into any function expecting one with arbitrary padding. 

We now program two mutually recursive functions to create a stream with 
zero padding from a stream with arbitrary (but finite!) padding. 


compress : (|pstream) — zstream 
omit : padding — zstream 


compress = As. let return np = force s in 
match np ((n, p) = return (n, some - thunk (omit p))) 
omit = àp. match p (none - p’ => omit p’ 
| some - s => compress s ) 


Example 3 (Omega). As a final example in this section we consider the em- 
bedding of the untyped A-calculus. The untyped term under consideration is 
(Ax. xx) (Av. ax). The first thing to notice is that this term is not even syntac- 
tically well-formed because x stands for a value, but in xz the function parts 
needs to be an expression. Closely related is that the “usual” definition for the 
embedding of the untyped »-calculus (see, for example, [42]) U = U > U isn’t 
properly polarized. So, we define it as UT = (ĻU) > U instead: 


w:({U) > U Q:U 
w = Ax. (force x) x Q = w (thunk w) 


Because our type definitions are equirecursive, both of these definitions are well- 
typed. Moreoever, we also have w : U and in fact the embedding of every untyped 
A-term will have type U. We also observe that w (thunk w) =? w (thunk w) and 
therefore represents a well-typed diverging term. Of course, f : U = f is also 
well-typed and reduces to itself in one step. 

Remarkably, with our notion of semantic typing we will see that 2 will have 
every type o~ and not just U [49, Appendix B, Example 9]! 
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3 Semantic Typing 


Our aim is to justify both typing and subtyping by semantic means. We therefore 
start with semantic typing of closed values and computations, written v € 7+ and 
e € o`. From this we can, for example, define semantic subtyping for positive 
types Tt Cot asVu.vETt Dveat. 

Conceptually, semantic typing is a mixed inductive/coinductive definition. 
Values are typed inductively, which yields the correct interpretation of purely 
positive types such as natural numbers, lists, or trees, describing finite data 
structures. Computations are typed coinductively because they include the 
possibility of infinite computation by unbounded recursion. While we assume we 
can observe the structure of values, computations e cannot be observed directly. 
Different notions of observation for computation would yield different definitions 
of semantic typing. For our purposes, since we want to allow unfettered recursion, 
we posit we can (a) observe the fact that a computation steps according to our 
dynamics, even if we cannot examine the computation itself, and (b) when a 
computation is terminal we can observe its behavior by applying elimination 
forms (for types Tt + 07 and &{£: o; beer) or by observing its returned value 
(for the type Îr”). 

Besides capturing a certain notion of observability, our semantics incorporates 
the usual concept of type soundness which is important both for implementations 
and for interpreting the results of computations. These are: 


Semantic Preservation (Theorem 1) Ife € o™ andere’ then e’ € o7. 

Semantic Progress (Theorem 2) Ife € ao then either e+ e’ for some e’ or e 
is terminal (but not both). This captures the usual slogan that “well-typed 
programs do not go wrong” [57]. An implementation will not accidentally 
treat a pair as a function or try to decompose a function as if it were a pair. 

Semantic Observation If v € 7* then the structure of the value v is deter- 
mined (inductively) by the type T™. Similarly, a terminal computation e € tr* 
must have the form e = return v with v € T. 


These combine to the following: if we start a computation for e € t7* then either 
e+>* return v for an observable value v € 7* after a finite number of steps, or e 
does not terminate. 

These are close to their usual syntactic analogues, but the fact that we do 
not rely on any form of syntactic typing is methodologically significant. For 
example, if we have a program that does not obey a syntactic typing discipline 
but behaves correctly according to our semantic typing, our results will apply 
and this program, in combination with others that are well typed, will both be 
safe (semantic progress) and return meaningfully observable results (semantic 
preservation and observation). This point has been made passionately by Dreyer 
et al. [28] and applied, for example, to trusted libraries in Rust [47]. Another 
example can be found in gradual typing [38, 60]. As long as we can prove by 
any means that the “dynamically typed” portion of the program is semantically 
well-typed (even if not syntactically so), the combination is sound and can be 
executed without worry, returning a correctly observable result. A third example 
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is provided by session types for message-passing concurrency [44]. While it is 
important to have a syntactic type discipline, processes in a distributed system 
may be programmed in a variety of languages some of which will have much 
weaker guarantees. Being able to prove their semantic soundness then guarantees 
the behavioral soundness of the composed system. 

Semantic typing in the context of call-by-push-value is well-suited for encoding 
computational effects, such as input/output, memory mutation, nontermination, 
etc. Call-by-push-value was designed as a study for the \-calculus with effects [53, 
Sec. 2.4], stratifying terms into values (which have no side-effects) and com- 
putations (which might). Through the lens of semantic typing, we can ensure 
behavioral soundness in the presence of effects. 


3.1 Semantic Typing with Observation Depth 


Despite the extensive work on mixed inductive and coinductive definitions [3, 11, 
20, 21, 22, 43, 48, 51, 59, 61, 69], there is no widely accepted style in presenting 
such definitions and reasoning with them concisely in an mathematical language 
of discourse. With some regret, we therefore present our semantic definition 
by turning the coinductive part into an inductive one, following the basic idea 
underlying step indexing [7, 8, 10, 27]. Since the coinduction has priority over 
the induction, arguments proceed by nested induction, first over the step index 
and second over the structure of the inductive definition. This representation 
of mixed definitions implies that reasoning over step indices has lexicographic 
priority over values. 

An alternative point of view is provided by sized types [5, 6]. Both sized types 
and step indexing employ the same concept of observation depth; however, for 
sized types, we would observe data constructors, whereas for step indexing we 
observe computation steps. General recursion is supported in our system because 
“productivity” in the negative layer means that computations can step rather 
than produce a data constructor. The step index is actually the (universally 
quantified) observation depth for a coinductively defined predicate. We do not 
index the (existentially quantified) size of the inductive predicate but use its 
structure directly since values are finite and become smaller. All step indices k, i 
and occasionally j range over natural numbers. We use three judgments, 


1. e Ek o~ (e has semantic type o~ at index k) 
2. e €k41 o (terminal e has semantic type o~ at index k + 1) 
3. v Ek T* (v has semantic type T* at index k) 


They should be defined by nested induction, first on k and second on the structure 
of v/e, where part 2 can rely on part 1 for a computation that is not terminal. We 
write v < v’ when v is a strict subexpression of v’. The clauses of the definition 
can be found in Figure 1. 

A few notes on these definitions. When expanding type definitions t = 7* and 
s =o we rely on the assumption that type definitions are contractive, so one of 
the immediately following cases will apply next. This means that unlike many 
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V Ek tv Ek Tt fort=r ED 
+ + = + + 
UV ExT O T2 v = (U1, V2), U1 Ex T] , and v2 Ek T3 for some v1, v2 
vee 1lSv=() 
v Ek Off: tH heer Êv = j - vj and vj Ek T for some j € L 


v Ek lo =v = thunk e and e €k o` for some e 


e€oa always 
e Ek+41 o` Ê (em e' and e' Ep o7) or (e terminal and e Êk+1 a) 


e Êk41 S Ê e Epi 07 for s=0 €X 


e êk41 TT Go Sev Eggi 0° for all i < k and v with v €; T? 


e Êk+1 &{£: Op }eeL 4 e.j Ek+1 oj for all je L 


A 


e k41 tr’ £ e = return v for some v Ek T? 


A 


ver v Ep T” forall k 


e€a0 Êe €r” forall k 


Fig. 1. Definition of Semantic Typing 


definitions in this style the types do not necessarily get smaller. For the inductive 
part (typing of values), the values do get smaller and for the coinductive part 
(typing of computations) the step index will get smaller because in the case of 
functions and records the constructed expression is not terminal. 

A number of variations on this definition are possible. A particularly interesting 
one avoids decreasing the step index unless recursion is unrolled [8, 27, 60] so 
sources of nontermination can be characterized more precisely. It may also be 
possible to keep the step index constant when analyzing a terminal computation 
of type t7*. Stripping the return constructor constitutes a form of observation 
and therefore decreasing the index seems both appropriate and simplest. 

The quantification over i < k in the case of terminal computations of function 
type seems necessary because we need the relation to be downward closed so that 
it defines a deflationary fixed point [4, 41]. Values and computations are then 
semantically well-typed if they are well-typed for all step indices. 


Lemma 2 (Downward Closure). 


1. e €k o~ implies e €; o~ foralli<k 
2. e Eny1 o implies e E41 o~ for alli < k 
3. V Ek TT implies v €; T™ for alli < k 


Proof. By a routine nested induction on k and the structure of v/e where part 2 
can appeal to part 1 when e is not terminal. 


Here are some semantic types that can easily be verified (see [49, Appendix 
BJ). 
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Example 4 (Semantic Typing). 


Ax. return z € Tt 4 tr? for all rT. 

Define so = 1 > so and eọ = Az. eo. Then eo € So. 

Define w = Aza. (force x) x and 2 = w (thunk w). Then 2 € o~ for every a. 
Define to = 1 ® to. Then there is no v such that v € to. 

Assume e € p` for some p~. Then e € tg +o” for every a. 


Che 


3.2 Properties of Semantic Typing 


The properties of semantic preservation and progress follow immediately just by 
applying the definitions and Lemma 1, so we elide their proofs. 


Theorem 1 (Semantic Preservation). Ife € ao” andere’ thene’€a-. 


Theorem 2 (Semantic Progress). Ife € ao then either e + e' ore is 
terminal, but not both. 


4 Subtyping 


The semantics of subtyping is quite easy to express using semantic typing. 
Definition 1 (Semantic Subtyping). 


1. Tt Cot iffv Eert implies v € ot for all v. 
2.7 Co iffeer implies e € oa for alle. 


We would now like to give a syntactic definition of subtyping that expresses 
an algorithm and show it both sound and complete with respect to the given 
semantic definition. The intuitive rules for subtyping shouldn’t be surprising, 
although to our knowledge our formulation is original. 


4.1 Empty and Full Types 


A first observation is that t+ C ot whenever 7* is an empty type, regardless of 
a+, because the necessary implication holds vacuously. So we need an algorithm 
to determine emptiness of a positive type. For the most streamlined presentation 
(which is also most suitable for an implementation) we first put the signature 
into a normal form that alternates between structural types and type names. 


Tt r= ty @to|1| {£ tebcer | Ls 
a = t> s | &{Ll: seyeeL | tt 
Jal Pee Se S Aa Se 
A usual presentation of emptiness maintains a collection of recursive types in 


a context in order to do a kind of loop detection. For example, the type t=18t 
is empty because we may assume that t is empty while testing 1 @ t. Instead, we 
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express this and similar kinds of arguments using valid circular reasoning. If one 
were to formalize it, it would be in CLKID” [14], although the succedent of any 
sequent is either empty or a singleton (as in CLJID” [12}). 

We construct circular derivations for t empty where t is a positive type name. 
Note that negative types are never empty. We can form a valid cycle when we 
encounter a goal t empty as a proper subgoal of t empty. Since we fix a signature 
X once and for all before defining each judgment such as emptiness or subtyping, 
we omit the index X since it never changes. The rules can be found in Figure 2. 


t= @{€: tejer EX tj empty (Yj € L) 


PEMP 
t empty (no rules for t = 1 or t = Js) 
t=ti 8Qt2 E€ X tı empty t=t1 Qt2 EX te empty 
@EMP1 @EMP2 
t empty t empty 


Fig. 2. Circular Derivation Rules for Emptiness 


Example 5. We continue Example 4, part (4), building a formal circular derivation. 
We first bring the signature into normal form, X = {up = 1, to = uo @ to}, and 
then construct 
CYCLE() 
to = uo ® to to empty 


QEMP?2 
to empty 


Theorem 3 (Emptiness). If t empty then for all k and v, v Ex t. 


Proof. We interpret the judgment t empty semantically as v Ep t F - (which 
expresses V ¢, t in a sequent), where ¢ is given and k and v are parameters and 
therefore implicitly universally quantified. The proof of this judgment is carried 
out in a circular metalogic. We translate each inference rule for t empty into a 
derivation for v €, t F -, where each unproven subgoal corresponds to a premise 
of the rule. When the derivation of t empty is closed by a cycle, the corresponding 
derivation of v €, t F - is closed by a corresponding cycle in the metalogic. The 
cases can be found in [49, Appendix D]. 


Next we symmetrically define what it means for a computation type o~ to be 
full, namely that it is inhabited by every (semantically well-typed) computation. 
A simple example is the type &{ }, that is, the lazy record without any fields. It 
contains every well-typed expression because all projections (of which there are 
none) are well-typed. It turns out the fullness is directly defined from emptiness. 
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We may construct a derivation using the following rules. It could be circular, 
since the judgment t empty allows circular derivations. 
s=ti >s2€X tı empty s=&{}ex 


>FULL —————  & FULL 
s full s full 


(no rule for s = ft) 


We interpret s full as the entailment e Ep rF e Ep s. In other words, we are 
assuming that e is semantically well-typed at some r and use that to show that 
it then will also be well-typed at the unrelated s. 


Theorem 4 (Fullness). If s full then e Ep r implies e Ep s for all k, e, andr. 
Proof. (see [49, Appendix E]) 


Note that there is no rule that would allow us to conclude that s = tı > sə is 
full if sə is full. Such a rule would be unsound: consider { } € &{ }. It is not the 
case that {} € 1 > &{ }, so 1 —> &{ } is not full, even though &{ } is. Similarly, 
Az.{} E€ 1> &{} but Aw. {} g &{l: 8&{}}, so &{l: &{}} is not full. 


4.2 Syntactic Subtyping 


The rules for syntactic subtyping build a circular derivation of tt < ut and 
s < r`. A circularity arises when a goal t < u or s < r arises as a subgoal 
strictly above a goal that is of one of these two forms. In general, we use t 
and u to stand for positive type names and s and r for negative type names 
without annotating those names. The polarity will also be clear from the context. 
Moreover, in the interest of saving space, we write t = T and s = o~ when 
these definitions are in the fixed global signature X. The rules can be found in 
Figure 3. In particular, we would like to highlight the LsuB*+, LSUB~, and TsuB 
rules, which incorporate emptiness and fullness into syntactic subtyping. For 
example, among other subtypings, the LSUB* rule establishes t < u whenever 
t = tı Q t2 and either tı empty or tz empty. 


Example 6. We revisit Example 1 to show that pos < std. We have annotated 
each subgoal from the @SuB rule with the corresponding label; we have elided 
the reference to the @SUB rule in the derivation for lack of space. Again, we 
normalize the signature before running the algorithm. 

ut =1 

std? = @{e:u,b0: pos, b1 : std} 

post = @{ bO : pos, b1 : std} 


1SUB CYCLE(x) CYCLE(}) 
< < < 
enn fel u<u [b0] pos < pos [b1] std < std 
[b0] pos < pos [b1] std < std (t) : 
[b0] pos < pos (*) [b1] std < std 


pos < std 
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t= @t2 u=uU1@u2 ti<u t2 < u2 t=1 u=1 
®SUB ——— _ 1SUB 
t<u t<u 


t= 9{L: tescer u = @O{k: urhrex VEL. te empty V (LE K Ate < uc) 


SUB 
t<u 
t=ļs w= lr s<r s =ti > S2 T =U Fr. wu<t, sg<re 
{SUB >SUB 
t<u s<r 
s=tt r=ftu t<u 
+SUB 
s<r 
s=&f{l: sejer T=S{jirjtiex VIE KG ELAS; <7; 
&SUB 
s<r 
tempty u=r* s=tt tempty r=o— s=o”. rfull 
———— Lsupt Lsu ———>—— TsvB 
t<u scr s<r 


Fig. 3. Circular Derivation Rules for Subtyping 


From a circular derivation we now construct a valid circular proof in an 
intuitionistic metalogic [12]. For example, t < u is interpreted as t C u, that is, 
every value in ¢ is also a value in u. We actually prove a slightly stronger theorem, 
namely that for the step index on both sides can remain the same. 


Theorem 5 (Soundness of Subtyping). 


1. Ift < u then v Ek tF v Epu for all k andv (and so, t C u). 
2. Ifs<r thene€EpkstHe€Epr for allk ande (and so, s Cr). 


Proof. We proceed by a compositional translation of the circular derivation of 
subtyping into a circular derivation in the metalogic. For each rule we construct 
a derived rule on the semantic side with corresponding premises and conclusion. 

When the subtyping proof is closed due to a cycle, we close the proof in the 
metalogic with a corresponding cycle. In order for this cycle to be valid, it is 
critical that the judgments in the premises of the derived rule are strictly smaller 
than the judgments in the conclusion. Since our mixed logical relation is defined 
by nested induction, first on the step index k and second on the structure of 
the value v or expression e, the lexicographic measure (k, v/e) should strictly 
decrease. Some sample cases can be found in [49, Appendix F]. 


Besides soundness, reflexivity and transitivity of syntactic subtyping are two 
other properties that we prove for assurance that the syntactic subtyping rules are 
sensible and have no obvious gaps. These proofs can be found in [49, Appendix 
G]. Ligatti et al. [55] also consider a notion of preciseness as a syntactic means for 
judging the correctness of their syntactic subtyping rules. As they mention in [55, 
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Sec. 6.2], this property is highly language-sensitive, depending on the choice 
of evaluation strategy (strict vs. nonstrict), where nonstrict subtyping relies 
on “which primitives are present in the language, sometimes in nonorthogonal 
ways.” Moreover, preciseness requires syntactically well-typed counterexamples, 
whereas we also consider ill-typed terms. We can straightforwardly prove that 
syntactic subtyping for purely positive types (in relation to strict evaluation) 
is complete with respect to semantic subtyping. We leave the preciseness of 
syntactic subtyping of negative types for future consideration. 


5 Syntactic Typing and Soundness 


We now introduce a syntactic typing judgment, at the moment without regard to 
decidability. Such a judgment is often called declarative typing in contrast with 
what is algorithmic typing in Section 6 (Figure 4). We prove that all syntactically 
well-typed terms are also semantically well-typed. Conceptually, a declarative 
system is unnecessary because the bidirectional system is very closely related, 
and there are no problems in justifying the soundness of the the bidirectional 
system directly with respect to our semantics. Besides the fact that there is 
a small amount of additional bureaucracy (the rules are divided between four 
judgments instead of two, and there are two additional rules), it is also the case 
that the standard versions of call-by-name and call-by-value use a similar form of 
declarative typing and are therefore easier to relate to our system in Section 8. 

Because all declarations in a signature can be mutually recursive, each dec- 
laration f : o7 = is checked assuming all other declarations are valid. The 
soundness proof below justifies this. The complete set of judgments and rules with 
their corresponding presuppositions can be found in [49, Appendix H, Figs. 7 and 
8]. For these rules, we need contexts I’, defined as usual with the presupposition 
that all variables declared in a context are distinct. 


I =. | T, xr" 


The rules for key judgments [+ v: tt and I F e: a can be obtained from 
the bidirectional rules in Section 6 by replacing both v < rt and v > T* with 
v: T* and, similarly, e <= o~ and e > a with e : o~. Moreover, one should 
drop the two annotation rules ANNOt and ANNO” because these are not in the 
source language for declarative typing. 

We would like to show that the syntactic typing rules are sound with respect to 
their semantic interpretation. For that, we first define simultaneous substitutions 
0 of closed values for variables and 0 €, I for the semantic interpretation of 
contexts as sets of substitutions at step index k. 


0 ::= -| 0,v/x£ 
(-) €k (-) always 
(0,v/£) €k (T,£ : Tt) £0 €k T and v Ep T? 


On the semantic side, we define 
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1. T Hv €p Tt iff for all 6 €y I we have v[6] €, T” 
2.0 He €k o7 iff for all 0 €p I we have ef0] Ek o7 


We now can prove a number of lemmas, one for each syntactic typing rule. 
A representative selection of the lemmas, each written as an admissible rule for 
semantic typing, can be given by: 


e Ek T >07 VERT? x:Tt Heck UL Ek T} V2 Ek Te 

ev Ek O At.e Ek Tt >07 (v1, v2) Ek TI @ TH 

v ER TT OTA x Thy TH} Fecha V Ek TT v Ek lo 
match v ((x,y) > e) Ek a return v Ek tr? force v Ek 07 


+ E es Ep a e Ek 0T vEpTt Tt<ot 


let return © = e1 in eg Ek a thunk e Ek Jo v Ek oT 


The proofs are somewhat interesting: some require induction on k, others 
follow more directly by definition. Due to a lack of space, the proofs can be found 
in [49, Appendix I], each admissible rule formulated as a separate lemma. 


Theorem 6 (Soundness of Syntactic Typing). Assume 0 €p I. 


1. If lu: then v6] Ek 7 
2. If H e:a~ then eff] Ek o7 


Proof. We construct a circular proof based on the typing derivation, and the 
typing derivations for all definitions f : o7 =e € X. There are three kinds of 
cases (see [49, Appendix I] for samples of each): 


1. The case of variables x follows by assumption on @. 

2. In the case of names f : 07 =e € X we either expand to e or close the proof 
with a cycle if we have expanded f already. 

3. All other rules follow by the lemmas presented above. 
In all these lemmas the step index remains constant for the premises, which 
is important so we can form a circular proof in the case of names. 


Because soundness is stated for all 0, I’, and k, we can immediately obtain 
corollaries such as that - v : r* implies that v € 7*, and that - + e : o~ implies 
that e € o7. 


6 Bidirectional Typing 


We now shift from our declarative typing system into an algorithmic one that 
describes a practical decision procedure. We choose to express it as a bidirec- 
tional typechecking algorithm, particularly to avoid inference issues regarding 
subsumption [45] and our extensive use of type names and variant records, as 
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Hur D Ev: r? Chuva r gri T, xri, yiti F e< o7 
&I QE 
T F (v1, v2) = Ti @ ry I H match v ((z,y) > e) Ho 
win er rreosl Freee 
— VAR — lJ 1E 
Chast rroi I F match v (() > e) =o” 
Tre=oa- Trv=s lo (j € L) Pree 
{1 IB BI 
T F thunk e = lo T F force v > o7 TE j- ve pfl: ti heer 
Phus pfl: thheer VEEL): T, gerf H e< o I, z:Tt H e< o7 
DE >I 
I F match v (€- xe > eejeer HO Thine <r" oo 
Trkesttoo Pruett VEE L): PE ee cog å 
+E I 
Irevsa_ Tt {l=echkeer = &{b: of Seer 
Phe=&{t: of feer (GEL) fio =eED Poet 
&Ek NAME tI 
Trej>o; CTrfso TD return v & tr? 
IF e > frt I, x:T™ keg Ho TeEvs7t tT <ot 
TE SUBt 
I H let return z = e1 in e2 =a Tkv<eot 
Tres>r rT <a Tevet Treea 
SUBT ANNOY ANNO” 
Tre=o rH (v:rt)=> rt IH (e:07)=> o7 


Fig. 4. Bidirectional Typing 


well as the approach’s deep integration with polarized logics [29, Section 8.3]. 
Moreover, bidirectional typing is quite robust with respect to language extensions 
where various inference procedures are not. 

Bidirectional typechecking [68] has been a popular choice for presenting algo- 
rithmic typing, especially when concerned with subtyping [30], and is decidable 
for a wide range of rich type systems. This approach splits each of the typing 
judgments, l H v:7t and I F e: a7, into checking (=) and synthesis (>) 
judgments for values and expressions, respectively: I H v = rt, FE v >rt 
andr bFe<=oa ,ITkFesoa-. 

We follow the recipe laid out by [25, 32]: introduction rules check and elimina- 
tion rules synthesize. More precisely, the principal judgment, premise or conclusion, 
has the connective being introduced by checking or eliminated by synthesis. 

We introduce two new forms of syntactic values (v:7*) and computations 
(e : 0) which exist purely for typechecking purposes and are erased before 
evaluation. This is not actually used on any of our examples because definitions 
in the signature already require annotations. 
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Applying the recipe, we can easily convert our declarative rules into bidirec- 
tional ones, as laid out in Section 5. The only rules we add to the system are 
ANNO* and ANNO”, which allow us to prove completeness. All the examples in 
Section 2.2 check with these rules and only require type annotations at the top 
level of the declarations in the signature. 

Due to our use of equirecursive types, the implementation of this system 
can closely follow the structure of the rules in Figures 2, 3, and 4. First, as 
mentioned in Section 4.1, we convert the signature into a normal form that 
alternates structural types and type names. Then, we determine all the empty 
type names using a memoization table for tt empty to easily construct circular 
derivations of emptiness (bottom-up) using the rules in Figure 2. If constructing 
such a derivation fails then t* is nonempty. Fullness is derived from emptiness 
non-recursively. From there, we build a memoization table for tt < ut and 
s < r`, for positive and negative type names, so we can construct circular 
derivations of subtyping between names (also bottom-up). This happens lazily, 
only computing tt < ut or s7 < r` if typechecking requires this information. 

Bidirectional typing, given subtyping, follows the rules in Figure 4, including 
the rules for positive and negative subsumption, but it requires that the types 
in annotations are also translated to normal form, possibly introducing new 
(user-invisible) definitions in the signature. 

The theorems (with straightforward proofs) for soundness and completeness 
of bidirectional typechecking can be found in [49, Appendix J, Thms. 12 and 13]. 


7 Interpretation of Isorecursive Types 


Our system uses equirecursive types, which allow many subtyping relations since 
there are no term constructors for folding recursive types. Moreover, equirecursive 
types support the normal form where constructors are always applied to type 
names (see Section 4.1), simplifying our algorithms, their description and im- 
plementations. Most importantly, perhaps, equirecursive types are more general 
because we can directly interpret isorecursive types, which are embodied by fold 
and unfold operators, into our equirecursive setting and apply our results. 

We give a short sketch here; details can be found in [49, Appendix K]. For 
every recursive type ppat.7t we introduce a definition tt = @{fold,, : [¢/a]r}. 
Similarly, for every corecursive type va~.o~ we introduce a definition s~ = 
&{fold, : [s/ala}. Now, the labels fold, and fold, tagging the sole choice of a 
unary variant or lazy record, respectively, play exactly the role that the fold 
constructor plays for recursive types. This entirely straightforward translation is 
enabled by our generalization of the binary sum and lazy pairs to variant and 
lazy records, respectively, so we can use them in their unary form. 


8 Call-by-Name and Call-by- Value 


More familiar than call-by-push-value (CBPV) are the lazy, call-by-name (CBN) 
and eager, call-by-value (CBV) operational semantics that underlie the Haskell 
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and ML families of functional programming languages. Levy [54] has shown that 
both CBN and CBV exist as fragments of CBPV, exhibiting translations from 
CBN and CBV types and terms into the CBPV language. In this section, we 
derive systems of subtyping for CBN and CBV from these translations into ours 
and prove them sound and complete. We discover that they are minor variants 
of existing systems for CBN [39] and CBV [55] subtyping. 

Because polarized subtyping is able to connect Levy’s translations with 
existing systems for CBN and CBV subtyping, it serves as further evidence that 
those prior translations and our subtyping rules are, in some sense, canonical. 
Moreover, it is yet one more piece of evidence that CBPV is an effective synthesis 
of evaluation orders in which to study the theory of functional programming. 


8.1 Call-by-name 


Consider a CBN language with the following types. The language of terms and 
the standard statics and dynamics can be found in [49, Appendix L]. 


T,0o =T >00 |T 8T |1| Of: teheer | Bfl: Teheer 


In this section, we will focus on function types T —> o and variant record types 
@{€: Tepee and their corresponding terms. 

Levy [54] presents translations, (—)“, from CBN types and terms to CBPV 
negative types and expressions, respectively. An auxiliary translation, |(—)7, on 
contexts is also used. Here, we elide the translation of terms other than variables 
and the terms for function and variant record types; the full translation on terms 
can be found in [54]. 


Types Terms 
(t>o) = {tr >o (x)= = return x 
(TEn) =T 84T) (Ax. e)" = àx.e 
aP =m (e1 e2)P = eB (thunk e5) 
(OLE: Te}jeer) = TOLE: LTF jeer 
(&{Ll: oeyeeL) = &{L: OF Seer 


We also translate type names t to fresh type names t=, translating the body of 
t’s definition and inserting additional type names as required for the normal form 
that alternates between structural types and type names. Levy [54] proves that 
well-typed terms are well-typed after the translation to CBPV is applied. Our 
syntactic typing rules are the same, so the theorem carries over to our setting. 

We adapt the subtyping system of Gay and Hole [39] to a A-calculus from the 
m-calculus, which reverses the direction of subtyping from their classical system 
and adds empty records, obtaining the CBN syntactic subtyping rules shown in 
Figure 5. 

These rules introduce a CBN syntactic subtyping judgment t < u. To dis- 
tinguish it from CBPV syntactic subtyping, we will take care in this section to 
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always include superscript pluses and minuses for CBPV type names, with CBN 
type names being unmarked. As for CBPV syntactic subtyping, the rules for 
CBN subtyping shown in Figure 5 build a circular derivation. Just as before, a 
circularity arises when a goal t < u arises as a proper subgoal of itself. 


t=ti>t2 u =u, > u2 wW<t t2 < u2 


>SUBN 
t<u 
t=, @t2 u= Quz: ti<u t2 < u2 t=1 u=1 
@SUBy ———__ 1SUB, 
t<u t<u 


t= O{é: tescer u = O@O{j: uj}jes (LC J) V(CE L): te < we 


DSUBn 
t<u 
t=&f{b: tihe u=8d{}]: u; jes (LIJ) WJ EJ): tj Su; 
&SUBN 
t<u 
t=@{} u=0 t=r ufull t= &{} 
—_—_———_ LSuBy —————— TSUBy ———— & FULLy 
t<u t<u t full 


Fig. 5. Circular Derivation Rules for Call-by-Name Subtyping 


These rules are exact analogues of those of Gay and Hole [39], with one 
exception. The three rules involving empty variants and records, namely LSUBy, 
TSUBy, and &FULLy, have no analogues in [39] only because their language did 
not include the corresponding empty internal and external choice types. 

As we will prove below, the CBN subtyping rules in Figure 5 are exactly 
those for which ¢ < u in the CBN language if and only if t7 < u” in the CBPV 
metalanguage. We thereby show that our polarized subtyping on the image of 
Levy’s CBN translation is sound and complete with respect to Gay and Hole’s 
CBN subtyping. 

Before proceeding to those proofs, it is worth pointing out that many of 
these CBN subtyping rules exactly follow CBPV, with a few notable differences. 
First, the @SUBy rule does not permit empty branches that do not occur in the 
supertype. This is because the | shifts that appear in (@{@: Te}eeL)” prevent 
each branch from being empty—there is no emptiness rule for | shifts in the 
CBPV subtyping. Second, for this CBN language, only types t = &{ } are full. In 
particular, a CBN function type t = tı > tg is never full, even though a CBPV 
function type s7 = tf — sz is full if the argument type tt is empty. This stems 
from the | shift that appears in the argument type in (T > o)" = |r? > oF. 
Third, the reader may be surprised by the omission of an emptiness judgment 
for CBN types. The LSUBy rule mentions the CBN type t = @{ }, which looks 
like it ought to be an empty type—the CBPV type tf = ®{ } is empty, after all. 
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Yes, but the CBN translation of t = @{ } is in fact the negative type t= = te{ }, 
and negative types are never empty. Nevertheless, t7 = t@{} < u7 in this case. 

Now we prove that polarized subtyping on the image of Levy’s CBN embed- 
ding, (—)“, is sound and complete with respect to the CBN subtyping rules of 
Figure 5. The proofs can be found in [49, Appendix L]. 


Theorem 7 (Soundness of Polarized Subtyping, Call-by-Name). 


1. Ift” full, then t full. 
2. ft? <u”, thent <u. 


Theorem 8 (Completeness of Polarized Subtyping, Call-by-Name). 


1. If t full, then t¥ full. 
2. Ift< u, then t” <u>. 


8.2 Call-by-Value 


We can play through a similar procedure for Levy’s CBV translation. Consider 
a CBV language with the following types. The language of terms, typing rules, 
and standard dynamics can be found in [49, Appendix M]. 


T,0 "= T> 0 | T1 ® T2 | 1 | ate: TeheeL | BLE: Ovheer 


The translations that Levy [54] presents from CBV types and terms to CBPV 
positive types and expressions are as follows. We only present the translation of 
variables, function abstractions, and function applications; the full translation on 
terms can be found in [54]. 


Types Terms 
(tT > 0)" ={(7" > to™) (x)” = return x 
TEOR) =T 8T (f) = force f for f:r=eE X 
(1)*=1 (Av. e)™ = return (thunk (Ax. e™)) 
(BLL: Teher)” = OLE: TH beer (e1 e2) = let return x = e% in 
(S40: ober) = lade: toP beer let return f = eï in 


(force f) x 


We also translate type names t to fresh type names t™, translating the body of 
t’s definition and inserting additional type names as required for the normal form 
that alternates between structural types and type names. 

Levy proves that well-typed terms translate to well-typed expressions. Because 
our syntactic typing rules are the same as his, his theorem carries over. 

We adapt the CBV subtyping system of Ligatti et al. [55] to our setting, 
which means that we include variants and lazy records with width and depth 
subtyping and replace isorecursive with equirecursive types. We obtain the 
syntactic subtyping rules shown in Figure 6. Once again, we will take care to 
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t=t te u =u >u? W<ti t2 < u2 


>SUBy 
t<u 
t= @t2 u= Quz? tı lu, t2 < u2 t=1 u=1 
@SUBy ——————_ 1SUBy 
t<u t<u 


t= OE: tehect 
u=O@O{j: ujsjer VEE L\ J): teempty VEE LAJ): te < ue 


@SUBy 
t<u 
t=&{e: te}jeer u= &f{j: ujjjes (LISI) VG EJ): tj <u; 
&SUBy 
t<u 
tempty u=0 t=ti>t2 w=uUi—u2 Ui empty 
— = au TsuB7 ? 
t<u t<u 
t= &{L: te}eeL U=U1 U2 u, empty t=t >t u=&{} 
TsuB$? TsUB?7 Š 
t<u t<u 
t=t1@t2 ti empty t= @{é: te}eer V(l € L): te empty 
@EMPy; DEMPy 
t empty t empty 


(no emptiness rules for 1, +, and &) 


Fig. 6. Circular Derivation Rules for Call-by-Value Subtyping 


distinguish the CBV syntactic subtyping judgment, t < u, from CBPV syntactic 
subtyping by marking CBPV type names with pluses and minuses. The rules 
shown in Figure 6 build circular derivations. 

These rules match those of Ligatti et al., with one minor exception that we 
will detail below. As we will prove, these rules are exactly those for which t < u 
in the CBV language if and only if t¥ < u” in the CBPV metalanguage. 

Before proceeding to the proofs, a few remarks about these rules. First, unlike 
the CBN @suBy rule, the @SUBy rule here includes the possibility that some 
components of a variant record type may be empty. More generally, the differences 
between CBN and CBV subtyping arise from the differences in emptiness and 
fullness between the two calculi. Emptiness and fullness are quite sensitive to 
the eager/lazy distinction between the two evaluation strategies. Because this 
distinction manifests in almost every layer of a complex type, the two subtyping 
systems diverge more than one might expect. 

Second, besides the adaptions mentioned above, the rules of Figure 6 diverge 
from those of Ligatti et al. in only one way. Ligatti et al. [55] have the rule 
“t < u if u = u ue and u empty” that generalizes the TSUB? °, _TSuB&~, and 
TsuB,*& rules of Figure 6 (assuming that Ligatti et al. would also have “t < u if 
u = &{ P” if they had included lazy records in their language). 
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Somewhat unexpectedly, polarized subtyping on the image of Levy’s CBV 
translation would be incomplete with respect to this more general rule. This is 
because the | shift inserted by Levy’s translation acts as a barrier to fullness: 
“t < u if u = Jr and r full” would be unsound in polarized subtyping. For 
example, Ligatti et al. have 1 < 0 — 1 for an empty type 0, but we do not 
have 1™ = 1 < {(0—>71) = (0— 1)" because the unit value () does not have 
type 4(0 — 11). This phenomenon is primarily of theoretical interest since it is 
confined to functions that can never be applied to any arguments and empty 
records (and only when they are compared against CBV types tı ® t2, 1, and 
@{é: tepee). Nevertheless, we conjecture a more differentiated translation of 
types and terms could restore completeness. 

These observations notwithstanding, we can prove that the CBV subtyping 
rules of Figure 6 are sound and complete with respect to the subtyping rules for 
CBPV under Levy’s translation. The proofs can be found in [49, Appendix M]. 


Theorem 9 (Soundness of Polarized Subtyping, Call-by-Value). 


1. If t™ empty, then t empty. 
2. If t? <u”, thent <u. 


Theorem 10 (Completeness of Polarized Subtyping, Call-by-Value). 


1. If t empty, then t® empty. 
2. Ift< u, then të <u 


9 Related Work and Discussion 


We now dive deeper into research related to our underlying theme on how 
polarization affects the interaction and definition of subtyping with recursive 
types across varying interpretations. 


Subtyping Recursive Types. The groundwork for coinductive interpretations of 
subtyping equirecursive types has been laid by Amadio and Cardelli [9], subse- 
quently refined by others [13, 37]. Danielsson and Altenkirch [22] also provided 
significant inspiration since they formally clarify that subtyping recursive types 
relies on a mixed induction/coinduction. In using an equirecursive presentation 
within different calculi, our work has been influenced by its predominant use in 
session types |19, 23, 40] and, in particular, Gay and Hole’s coinductive subtyping 
algorithm [39], which we take as a template for call-by-name typing. 

Another important influence has been the work on refinement types [24, 34] 
which are also recursive but exist within predefined universes of generative types. 
As such, subtyping relations are simpler in their interactions, but face many of 
the same issues such as emptiness checking. One can see this paper as an attempt 
to free refinement types from some of its restrictions while retaining some of its 
good properties. The key ingredients are (1) explicitly separating values from 
computations via polarization, (2) the introduction of variant and lazy records 
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and their width and depth subtyping rules (owing much to [70]), and (3) simple 
bidirectional typechecking. What is still missing is the use of intersections and 
unions that allow subtyping to propagate more richly to higher-order types [31]. 

Our treatment of empty—value-uninhabited—and full types in Section 4.1, 
as well as our call-by-value interpretation in Section 8.2 builds on Ligatti et al.’s 
work [55] on precise subtyping with isorecursive types. 

Our direct interpretation of isorecursive types and translation into an equire- 
cursive setting furthers numerous works either comparing or relating both for- 
mulations [67, 73, 74]. In particular, Abadi and Fiore [1] and more recently 
Patrigniani et al. [63] prove that terms in one equirecursive setting can be typed 
in the other (and vice versa) with varying approaches. The former treats type 
equality inductively and is focused on syntactic considerations. The latter treats 
type equality coinductively and analyzes types semantically. Neither of these 
handle subtyping or mixed coinductive/inductive types like in our study. 

Finally, Zhou et al. [76] serves as a helpful overview paper on subtyping recur- 
sive types at large and discusses how Ligatti et al.’s complete set of rules requires 
very specific environments for subtyping, as well as non-standard subtyping rules. 
This observation demonstrates why our semantic typing/subtyping approach can 
offer a more flexible abstraction for reasoning about expressive type systems 
while maintaining type safety. 


Semantic Typing and Subtyping. Semantic typing goes back to Milner’s semantic 
soundness theorem [57], which defined a well-typed program being semantically 
free of a type violation. Whereas syntactic typing specifies a fixed set of syntactic 
rules that safe terms can be constructed from, semantic typing here combines 
two requirements: positive types circumscribe observable values, exposing their 
structure, and computations of negative types are only required to behave in a 
safe way. As we demonstrate throughout section 5, we can prove our semantic 
definitions compatible with our syntactic type rules, leaving syntactic type 
soundness to fall out easily (Theorem 6). 

Milner’s initial model didn’t scale well to richer types, like recursive types. 
With a lens toward more expressive systems, step indexing has become a prominent 
approach [7, 8, 10, 27], which we use to observe that a computation in our model 
steps according to our dynamics. 

As with syntactic/semantic typing, syntactic subtyping is the more typical 
approach in modeling subtyping relations over its semantic counterpart. Nonethe- 
less, in what’s operated almost parallel to the research on semantic types, research 
on semantic subtyping has also made strides [35, 15, 66]. Mainly, these exploit 
semantic subtyping for developing type systems based on set-theoretic subtyping 
relations and properties, particularly in the context of handling richer types, 
including polymorphic functions [17, 16, 65] and variants [18], recursive types 
(interpreted coinductively), and union, intersection, and negation connectives [36]. 
A major theme in this line of work is excising “circularity” [15, 36] by means of 
an involved bootstraping technique, as issues arise when the denotation of a type 
is defined simply as the set of values having that type. 
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We depart from this line of research in the treatment of functions (defined 
computationally rather than set-theoretically), recursive types (equirecursive 
setting; inductive for the positive layer and coinductive for the negative layer), 
both variant and lazy record types, and the commitment to explicit polarization 
(including our incorporation of emptiness/fullness). The latter of which eliminates 
circularity and ties together multiple threads defined in this study. 

With this combination of semantic typing and subtyping, our work provides 
a metatheory for a more interesting set of typed expressions while also providing 
a stronger and more flexible basis for type soundness [28], as semantic typing 
can reason about syntactically ill-typed expressions as long as those expressions 
are semantically well-typed. This combination scales well to our polarized, mixed 
setting and focus on subtyping in the presence of recursive types. 


Polarized Type Theory and Call-by-Push- Value. At the core of this work has 
been the call-by-push-value [53, 54] (CBPV) calculus with its notions of values, 
computations, and the shifts between them. Beyond Levy’s work, this subsuming 
paradigm has formed the foundation of much recent research, ranging from 
probabilistic domains [33] to those reasoning about effects [56] and dependent 
types [64]. New et al.’s [60] gradual typing extension to the calculus shares 
similarities with our use of step indexing, but its relations (binary rather than 
unary), dynamics, and step-counting are treated differently, and its goals are 
very different as well, including no coverage on subtyping. 

To our knowledge, there are no direct treatments of subtyping recursive types 
in a CBPV system or applying a full semantic typing approach in this context 
with subtyping. It is, as we’ve shown, a fruitful setting for our investigation since 
the explicit polarization of the language mirrors the mixed reasoning required to 
analyze the subtyping. 

Though CBPV and polarized type theory typically go hand-in-hand, there 
are investigations that look at polarization (focusing) and algebraic typing and 
subtyping from alternate perspectives. Steffen [72] predates Levy’s research and 
presents polarity as a kinding system for exploiting monotone and antimonotone 
operators in subtyping function application. Abel [2] built upon this and extended 
it with sized types. The inherent connection between types and evaluation strategy 
has also been studied in the setting of program synthesis [71] and proof theory [58], 
but these do not share our specific semantic concerns. 

Polarization as an organizing principle for subtyping is present in Zeilberger’s 
thesis [75], but addresses a problem that is fundamentally different in multiple 
ways, e.g. using “classical” types and continuations, and no width and depth 
subtyping. The biggest difference, however, is that its setting considers refinement 
types, while we do not have a refinement relation and show that some of the 
advantages of refinement types can be achieved without the additional layer. 

Two studies on a global approach to algebraic subtyping [26, 62] define 
subtyping relationships with generative datatype constructors while discussing 
polarity (here with a different meaning) and discarding semantic interpretations. 
However, the generative nature of datatype constructors in this work makes its 
quite different from ours. 
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Mixed Coinductive/Inductive Reasoning for Recursive Types. The natural sepa- 
ration of positive and negative layers in CBPV led us through the literature on 
mixed coinductive/inductive definitions for recursive types. Related to our work 
in this paper, Danielsson and Altenkirch [22] and Jones and Pearce [46] provide 
definitions for equirecursive subtyping relations in a mixed setting while using a 
suspension monad for non-terminating computations, which shares an affinity 
with force/return CBPV computations. Danielsson and Altenkirch, however, do 
not try to justify the structural typing rules themselves via semantic typing of 
values or expressions—only the subtyping rules. Jones and Pearce are closer to 
our approach since they also use a semantic interpretation of types for expressions. 
While not polarized, they do consider inductive/coinductive types separately, but 
do not lift them to cover function types, instead studying other constructs such 
as unions. 

Komendantsky [48] manages infinitary subtyping (for only function and 
recursive types) via a semantic encoding by folding an inductive relation into 
a coinductive one. We work in the opposite direction, turning the coinductive 
portion into an inductive one by step indexing. Lepigre and Raffali [52] mix 
induction and coinduction in a syntax-directed framework, focusing on circular 
proof derivations and sized types [6]; also managing inductive types coinductively. 
Cohen and Rowe [21] provide a proposal for circular reasoning in a mixed setting, 
but the focus is on a transitive closure logic built around least and greatest 
fixed point operators. It seems quite plausible that we could use such systems to 
formalize our investigation, although we found some merit in using step-indexing 
and Brotherston and Simpson’s circular proof system for induction [14]. 


10 Conclusion 


We introduced a rich system of subtyping for an equirecusive variant of call-by- 
push-value and proved its soundness via semantic means. We also provided a 
bidirectional type checking algorithm and illustrated its expressiveness through 
several different kinds of examples. We showed the fundamental nature of the 
results by deriving systems of subtyping for isorecursive types and languages 
with call-by-name and call-by-value dynamics. The limitations of the present 
systems lie primarily in the lack of intersection and union types and parametric 
polymorphism which are the subject of ongoing work. 
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Abstract. Algebraic effects offer a versatile framework that covers a 
wide variety of effects. However, the family of operations that delimit 
scopes are not algebraic and are usually modelled as handlers, thus pre- 
venting them from being used freely in conjunction with algebraic oper- 
ations. Although proposals for scoped operations exist, they are either 
ad-hoc and unprincipled, or too inconvenient for practical programming. 
This paper provides the best of both worlds: a theoretically-founded 
model of scoped effects that is convenient for implementation and rea- 
soning. Our new model is based on an adjunction between a locally 
finitely presentable category and a category of functorial algebras. Using 
comparison functors between adjunctions, we show that our new model, 
an existing indexed model, and a third approach that simulates scoped 
operations in terms of algebraic ones have equal expressivity for han- 
dling scoped operations. We consider our new model to be the sweet 
spot between ease of implementation and structuredness. Additionally, 
our approach automatically induces fusion laws of handlers of scoped 
effects, which are useful for reasoning and optimisation. 


Keywords: Computational effects - Category theory - Haskell - Alge- 
braic theories - Scoped effects - Handlers - Abstract syntax 


1 Introduction 


For a long time, monads have been the go-to approach for purely 
functional modelling of and programming with side effects. However, in recent 


years an alternative approach, algebraic effects [a8], is gaining more traction. A 
big breakthrough has been the introduction of handlers 52], which has made 
algebraic effects suitable for programming and has led to numerous dedicated 
languages and libraries implementing algebraic effects and handlers. In compar- 
ison to monads, algebraic effects provide a more modular approach to computa- 
tions with effects, in which the syntax and semantics of effects are separated— 
computations invoking algebraic operations can be defined syntactically, and the 
semantics of operations are given by handlers separately in possibly many ways. 

A disadvantage of algebraic effects is that they are less expressive than mon- 
ads; not all effects can be easily expressed or composed within their confines. 
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For instance, operations like catch for exception handling, spawn for parallel 
composition of processes, or once for restricting nondeterminism are not con- 
ventional algebraic operations; instead they delimit a computation within their 
scope. Such operations are usually modelled as handlers, but the problem is that 
they cannot be freely used amongst other algebraic operations: when a handler 
implementing a scoped operation is applied to a computation, the computation 
is transformed from a syntactic tree of algebraic operations into some semantic 
model implementing the scoped operation. Consequently, all subsequent oper- 
ations on the computation can only be given in the particular semantic model 
rather than as mere syntactic operations, thus nullifying the crucial advantage 
of modularity when separating syntax and semantics of effects. 

To remedy the situation, Wu et al. proposed a practical, but ad-hoc, 
generalisation of algebraic effects in Haskell that encompasses scoped effects, 
that has been adopted by several algebraic effects libraries [82] [42] 56]. More 
recently, Piróg et al. sought to put this ad-hoc approach for scoped effects 
on the same formal footing as algebraic effects. Their solution resulted in a 
construction based on a level-indexed category, called indexed algebras, as the 
way to give semantics to scoped effects. However, this formalisation introduces 
a disparity between syntax and semantics that makes indexed algebras not as 
structured as the programs they interpret, where they use an ad-hoc hybrid 
fold that requires indexing for the handlers, but not for the program syntax. 
Moreover, indexed algebras are not ideal for widespread implementation as they 
require dependent typing, in at least a limited form like GADTs (25). 

This paper presents a more structured way of handling scoped effects, which 
we call functorial algebras. They are principled and formally grounded on cat- 
egory theory, and at the same time more structured than the indexed algebras 
of Piróg et al. [46], in the sense that the structure of functorial algebras directly 
follows the abstract syntax of programs with scoped effects. Functorial algebras 
enjoy the following advantages over indexed algebras: 


— Functorial algebras admit a simpler interface and implementation 
without requiring dependent types or GADTs. This enables the adoption of 
scoped effects in a wider range of languages. 

— Functorial algebras are easier to reason about due to their structuredness. 
In particular, it allows us to derive a one-pass handle function 
that does not convert syntax to the free functorial algebra. In comparison, 
a similar one-pass recursion scheme is much harder for indexed algebras to 
derive. Although Piróg et al. showed one in their implementation, they did 
not prove its correctness. In this paper, we provide the missing proof by 
converting indexed algebras to functorial ones (Example 12}. 

— These improvements have not sacrificed expressivity, since translating be- 
tween functorial algebras and existing approaches is possible (Section 4}. 


The structure and contributions of this paper are as follows: 


— We highlight the loss of modularity when modelling scoped operations as 
handlers and sketch how the problem is solved using functorial algebras in 


Haskell, along with a number of programming examples (Section 2). 
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— We develop a category-theoretic foundation of functorial algebras as a notion 
of handlers of scoped effects. Specifically, we show that there is an adjunc- 
tion between functorial algebras and a base category, inducing the monad 
modelling the syntax of scoped effects (Section 3). 

— We show that the expressivity of functorial algebras, Piróg et al. [46}’s in- 
dexed algebras, and simulating scoped effects with algebraic operations and 
recursion are equal, by constructing interpretation-preserving functors be- 
tween the three categories of algebras (Section 4). 

— We present the fusion law of functorial algebras, which is useful for reasoning 
and optimisation. The fusion law directly follows from the naturality of the 


adjunction underlying functorial algebras (Section 5). 


Finally, we discuss related work (Section 6) and conclude (Section 7). An ex- 
tended version of this paper contains appendices and proofs for this paper, 
and our implementations can also be found online (71]. 


2 Scoped Effects for the Working Programmer 


We start with a recap of handlers of algebraic effects (Section 2.1), and then 
we highlight the loss of modularity when modelling non-algebraic effectful op- 


erations as handlers (Section 2.2). We then show how the problem is solved by 
modelling them as scoped operations and handling them with functorial algebras 
in Haskell (Section 2.3), whose categorical foundation will be developed later. 


2.1 Handlers of Algebraic Effects 


For the purpose of demonstration, in this section we base our discussion on a sim- 
plistic implementation of effect handlers in Haskell using free monads, although 
the problem with effect handlers highlighted in this section applies to other more 
practical implementations of effect handlers, either as libraries (e.g. [27|[33)) or 
standalone languages (e.g. (7|[36]/40)). 

Following Plotkin and Pretnar , computational effects, such as exceptions, 
mutable state, and nondeterminism, are described by signatures of primitive ef- 
fectful operations. Signatures can be abstractly represented by Haskell functors: 


class Functor f where fmap::(a > b) > fa—->f b 


The following functor ES (with the evident Functor instance) is the signature 
of three operations: throwing an exception, writing and reading an Jnt-state: 


data ES x = Throw | Put Int x | Get (Int > zx) (1) 


Typically, a constructor of a signature functor X has a type isomorphic to P > 
(R > x) > X z for some types P and R. As in (2), the types of the three 
constructors are isomorphic to Throw :: () > (Void > x) > ES zx, Put :: Int > 
(() > z) > ES z and Get :: () > (Int > x) > ES x respectively where Void is 
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the empty type. Each constructor of a signature functor X is thought of as an 
operation that takes a parameter of type P and produces a result of type R, or 
equivalently, has R-many possible ways to continue the computation after the 
operation. Given any (signature) functor X, computations invoking operations 
from X are modelled by the following datatype, called the free monad of X, 


data Free X a = Return a | Call (X (Free X a)) 


whose first case represents a computation that just returns a value, and the 
second case represents a computation calling an operation from X with more 


Free X a subterms as arguments, which are understood as the continuation of 
the computation after this call, depending on the outcome of this operation. 
The inductive datatype Free X a comes with a recursion principle: 


handle :: (X b > b) > (a b) > Free Dab 
handle alg g (Return x) =g x 
handle alg g (Call op) = alg (fmap (handle alg g) op) 


which folds a tree of operations Free X a into a type b, providing a way X b > b, 
usually called a X-algebra, to perform operations from X on b and a way a > b 


to transform the returned type a of computations to b. The function handle can 
be used to give Free X a monad instance: 


return ::a— Free Xa (=) :: Free X a > (a > Free X b) > Free X b 
return = Return m `> k = handle Call k m 


The monadic instance allows the programmer to build effectful computations 
using the do-notation in a clean way. For example, the following program updates 
the state s to n / s for some n :: Int, and throws an exception when s is 0: 


safeDiv :: Int > Free ES Int 
safeDiv n = do s + get;if s = 0 then Call Throw 
else do { put (n / s); return (n / s)} 


where the auxiliary wrapper functions (the so-called smart constructors in the 
Haskell community) that invoke Call appropriately are 


get = Call (Get Return) put n = Call (Put n (Return ())) 


The free monad merely models effectful computations syntactically without 
specifying how these operations are actually implemented. Indeed, the program 
safeDiv above is defined without saying how mutable state and exceptions are 
implemented at all. To actually give useful semantics to programs built with free 
monads, the programmer uses the handle function above to interpret programs 
with X-algebras, which are called handlers in this context. 

For example, given a program r::Free ES a for some a, a handler catchHdl r:: 
ES (Free ES) — Free ES that gives the usual semantics to throw is 


catchHdl :: Free ES a + ES (Free ES a) > Free ES a (2) 
catchHdl r Throw = r; catchHdl r op = Call op 
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which evaluates r for recovery in case of throwing an exception, and leaves 
other operations untouched in the free monad. An important advantage of the 
approach of effect handlers is that different semantics of a computational effect 
can be given by different handlers. For example, suppose that in some scenario 
one would like to interpret exceptions as unrecoverable errors and stop the exe- 
cution of the program when an exception is raised. Then the following handler 
can be defined for this behaviour: 


catchHdl’ :: Free ES a + ES (Free ES (Maybe a)) + Free ES (Maybe a) (3) 
catchHdl’ r Throw = return Nothing; catchHdl’ r op = Call op 


As expected, applying these two handlers to the program safeDiv 5 produces 
different results (of types Free ES Int and Free ES (Maybe Int) respectively): 


handle (catchHdl (return 42)) return (safeDiv 5) 
= do s + get;if s = 0 then return 42 else do { put (n / s); return (n / s)} 


handle (catchHdl’ (return 42)) (return - Just) (safeDiv 5) 
= do s + get;if s = 0 then return Nothing 
else do { put (n / s); return (Just (n / s))} 


Note that exception throwing and catching are modelled differently in the ap- 
proach of algebraic effects and handlers, one as an operation in the signature 
ES and one as a handler, although it is natural to expect both of them to 
be operations of the effect of exceptions. This asymmetry results from the fact 
that exception catching is not algebraic: if catch was modelled as a binary op- 
eration in the signature, then the monadic bind >>= of the free monad earlier, 
which intuitively means sequential composition of programs, would imply that 
(catch r p) >= k = catch (r >= k) (p >= k), which is semantically undesirable. 
Thus the perspective of Plotkin and Pretnar is that non-algebraic operations 
like catch should be deemed different from algebraic operations, and they can 
be modelled as handlers (of algebraic operations). 


2.2 Scoped Operations as Handlers Are Not Modular 


However, this treatment of non-algebraic operations leads to a somewhat subtle 
complication: as observed by Wu et al. (70), when non-algebraic operations (such 
as catch) are modelled with handlers, these handlers play a dual role of (i) mod- 
elling the syntax of the operation (the scope for which exceptions are caught 
by catch) and (ii) giving semantics to it (when an exception is caught, run the 
recovery program). To see the problem more concretely, ideally one would like to 
have a syntactic operation catch of the following type that acts on computations 
without giving specific semantics a priori, 


catch :: Free ES a — Free ES a — Free ES a 
allowing to write programs like 


prog = do {x + catch (safeDiv 5) (return 42); put (x + 1) } (4) 
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and the semantics of (both algebraic and non-algebraic) operations in prog can be 
given separately by handlers. Unfortunately, when catch is modelled as handlers 
catchHdl or catchHdl’ as in the last subsection, the program prog must be written 
differently depending on which handler is used: 


do x + handle (catchHdl (return 42)) return (safeDiv 5); put (x + 1) 


vs. do «Mb « handle (catchHdl’ (return 42)) (return - Just) (safeDiv 5) 
case rMb of { Nothing — return Nothing 
(Just x) > do r + put (x + 1); return (Just r)} 


The issue is that these handlers interpret the operation catch in different seman- 
tic models, Free ES a and Free ES (Maybe a), and this affects both the value x 
that is returned, and the way the subsequent put is expressed. Therefore, non- 
algebraic operation catch modelled as handlers is not as modular as algebraic 
operations, weakening the advantage of programming with algebraic effects. 


2.3 Scoped Effects and Functorial Algebras 


Now we present an overview of a solution to the problem highlighted above 
by modelling exception catching as scoped effects and handle them using 
functorial algebras, which will be more formally developed in later sections. 


Syntax of Scoped Operations To achieve modularity for (non-algebraic) opera- 
tions delimiting scopes, such as catch, which are called scoped operations, Piróg 
et al. generalise the free monad Free X to a monad Prog X I accommo- 
dating both algebraic and scoped operations. The monad is parameterised by 
two functors X and I’, called the algebraic signature and the scoped signature 
respectively. The intention is that a constructor Op :: (R => x) > X x of the 
algebraic signature represents an algebraic operation Op producing an R-value 
as usual, whereas a constructor Sc :: (N — x) + I zx of the scoped signature 
represents a scoped operation Sc creating N-many scopes enclosing programs. 


Example 1. As in the previous subsection, the effect of exceptions has an alge- 
braic operation for throwing exceptions, which produces no values, and a scoped 
operation for catching exceptions, which creates two scopes, one enclosing the 
program for which exceptions are caught, and the other enclosing the recovery 
computation. Thus the algebraic and scoped signatures are respectively 


data Throw « = Throw data Catch x = Catch x x (5) 


Example 2. An effect of explicit nondeterminism has two algebraic operations 
for nondeterministic choice and a scoped operation Once: 


data Choice x = Fail | Or x x data Once x = Oncex (6) 


The intention is that this effect implements logic programming (20|—solutions 
to a problem are exhaustively searched: operation Or p q splits a search branch 
into two; Fail marks a failed branch; and the scoped operation Once p keeps 
only the first solution found by p, making it semi-deterministic, which is useful 
for speeding up the search with heuristics from the programmer. 
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Similar to the free monad, the Prog monad models the syntax of computa- 
tions invoking operations from X and I: 


data Prog X T a = Return a | Call (X (Prog © T a)) (7) 
| Enter (I (Prog X I (Prog ST a))) 


Thus an element of Prog © I a can either (i) return an a-value without 
causing effects, or (ii) call an algebraic operation in X with more subterms 
of Prog X T a as the continuation after the operation, or (iii) enter the scope of 
a scoped operation. The third case deserves more explanation: the first Prog in 
(I (Prog X T (Prog X T a))) represents the programs enclosed by the scoped 
operation, and the second Prog represents the continuation of the program after 
the scoped operation, and thus the boundary between programs inside and out- 
side the scope is kept in the syntax tree, which is necessary because collapsing 
the boundary might change the meaning of a program. The distinction between 
algebraic and scoped operations can be seen more clearly from the monadic bind 
of Prog (the monadic return of Prog is just Return): 


(>=) :: Prog XT a> (a > Prog X I b) > Prog ST b 
(Return a) >=k=ka 

(Call op) >= k = Call (fmap (>=k) op) 

(Enter sc) >= k = Enter (fmap (fmap (>=k)) sc) 


For algebraic operations, extending the continuation (=k) directly acts on the 
argument to the algebraic operation, whereas for scoped operation, (=k) acts 
on the second layer of Prog. Thus for an algebraic operation o, (o p) >= k and 
o (p >= k) have the same representation, whereas for a scoped operation s, 
(s p) >= k and s (p >= k) have different representations, which is precisely the 
distinction between algebraic and scoped operations. 

The constructors Call and Enter are clumsy to work with, and for writing 
programs more naturally, we define smart constructors for operations. Generally, 
for algebraic operations Op:: F x — X x and scoped operations Sc::G x >T zx, 
the smart constructors are 


op:: F (Prog X F a)—> Prog Da se:: G (Prog XT a) + Prog XT a 
op = Call - Op sc = Enter - fmap (fmap return) - Sc 


For example, the smart constructor for Catch (Example 1) is 


catch :: Prog X Catch a + Prog X Catch a —> Prog X Catch a 
catch h r = Enter (Catch (fmap return h) (fmap return r)) 


With all machinery in place, now we can define the program using Prog that 
we could not write with Free: 


prog = do {x + catch (safeDiv 5) (return 42); put (x + 1)} 


Handlers of Scoped Operations Similar to Free, the Prog monad merely models 
the syntax of effectful computations, and more useful semantics need to be given 
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data EndoAlg ae f = EndoAlg { data BaseAlg SP f a= 
returnE :: Yz. T > f £, BaseAlg {callB ::X a—>a 
callE ::Yx. X (f x)> fz, _enterB T (f a) > a} 


enterE ::Va. I (f (f «)) >f} 


heata :: (Functor X, Functor T) > (EndoAlg X If) > Prog 2 Tafa 
heata alg (Return x) = returnE alg x 
hcata alg (Call op) = (callE alg - fmap (heata alg)) op 
hcata alg (Enter scope) = (enterE alg - fmap (hcata alg - fmap (hcata alg))) scope 
handle :: (Functor X, Functor I’) 
=> (EndoAlg X I x) > (BaseAlg X I x b) > (a> b) > Prog YP a > b 
handle ealg balg gen (Return x) = gen z 
handle ealg balg gen (Call op) = (callB balg - fmap (handle ealg balg gen)) op 
handle ealg balg gen (Enter sc) 
= (enterB balg - fmap (hcata ealg - fmap (handle ealg balg gen))) sc 


Fig. 1: A Haskell implementation of handling with functorial algebras 


by handlers. Although Piróg et al. developed a notion of indexed algebras for 
this purpose, indexed algebras turn out to be more complicated than necessary 
(we will discuss them in[Section 4), and the contribution of this paper is a simpler 
kind of handlers for scoped operations, which we call functorial algebras. 

Given signatures X and I’, a functorial algebra for them is a quadruple 
(f, b, ealg, balg) for some functor f called the endofunctor carrier, type b called 
the base carrier. The other two components ealg :: EndoAlg X I f and balg :: 
BaseAlg X I’ f b are called the endofunctor algebra and the base algebra. Their 
types are fully shown in [Figure 1] The intuition is that functor f and ealg inter- 
pret the part of a program enclosed by scoped operations, and the type b and 
balg interpret the part of a program not enclosed by any scopes. 


Example 3. The standard semantics of exception catching (cf. handler (2) can 
be implemented by a functorial algebra with the conventional Maybe functor as 
the endofunctor carrier with the following EndoAlg: 


excE :: EndoAlg Throw Catch Maybe 


ezcE = EndoAlg {..} where enterE :: Catch (Maybe (Maybe a)) 
return = Just — Maybe a 
callE Throw = Nothing enterE (Catch Nothing r) = join r 


enterE (Catch (Just k) _) = k 


For the base carrier that interprets operations not enclosed by any catch, a 
straightforward choice is just taking Maybe a as the base carrier for a type a, 
and setting callB = callE and enterB = enterE, which means that operations 
inside and outside scopes are interpreted in the same way. 

In general, we can define a specialised version of handle that only 
takes an endofunctor algebra as input for interpreting operations inside and 
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outside scopes in the same way: 


handleE :: (EndoAlg ST f) > Prog SDP a-f a 
handleE ealg@(EndoAlg {..}) = handle ealg (BaseAlg callE enterE) returnE 


Applying handleE excE to the following program produces Just 43 as expected. 
do {x + catch throw (return 42); return (a + 1)} (8) 


For the non-standard semantics (cf. Bh) that disables exception recovery, one 
can define another endofunctor algebra excE’ by replacing enterE in ezcE with 


enterE' :: Catch (Maybe (Maybe a)) > Maybe a 
enterE' (Catch Nothing _) = Nothing; enterE' (Catch (Just k) _) = k 


With ercE', handling the program in produces Nothing as expected. 


Now we provide some intuition for how functorial algebras work. First note 
that the three fields of EndoAlg in precisely correspond to the three 
cases of Prog (7). Thus by replacing the constructors of Prog with the cor- 
responding fields of EndoAlg, we have a polymorphic function hcata ealg :: 
Va. Prog X T x + f x (Figure 1) turning a program into a value in f. 

The function handle (Figure 1) takes a functorial algebra, a function gen :: 
a — banda program p as arguments, and it handles all the effectful operations in 
p by using hcata ealg for interpreting the part of p inside scoped operations and 
balg for interpreting the outermost layer of p outside any scoped operations. The 
function gen corresponds to the ‘value case’ of handlers of algebraic effects, which 
transforms the a-value returned by a program into the type b for interpretation. 

We close this section with some more examples of handling scoped effects 
with functorial algebras. The supplementary material of this paper also contains 
an OCaml implementation of functorial algebras and the following examples. 


Example 4. The standard way to handle explicit nondeterminism with the semi- 


deterministic operator once (Example 2) is using a functorial algebra with the 
list functor as the endofunctor carrier together with the following algebra: 


ndetE :: EndoAlg Choice Once |] enterE :: Once [[a]] > [a] 
ndetE = EndoAlg {..} where enterE (Once x) = 
callE :: Choice [a] > [a] if z = |] then |] else head x 
callE Fail =|] returnE :: a — [a] 
callE (Or z y) = z +y returnE z = |z] 


Then applying handleE ndetE to the following program produces [1,2] as ex- 
pected. In comparison, if once were algebraic, the result would be [1]. 


do {n + once (or (return 1) (return 3)); or (return n) (return (n + 1))} 


Example 5. In the last example we used the list functor to interpret explicit 
nondeterminism, resulting in the depth-first search (DFS) strategy for searching. 
Noted by Spivey [59], other search strategies can be implemented by other choices 
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of functors. For example, depth-bounded search (DBS) can be implemented with 
the functor Int + [a], and breadth-first search (BFS) can be implemented with 
the functor [[a]] (or Kidney and Wu [31)’s more efficient LevelT functor). 

A powerful application of scoped effects is modelling search strategies: 


data Strategy x = DFS «| BFS «| DBS Int x 


so that the programmer can freely specify the search strategy of nondetermin- 
istic choices in a scope. The algebraic signature Choice and scoped signature 
Strategy can be handled by a functorial algebra carried by the endofunctor 
({a], [[@]], Int > [a]) and a base type [a] (assuming that depth-first search 
is the default strategy). The complete code is in the supplementary material. 


Example 6. A scoped operation for the effect of mutable state is the operation 
local s p that executes the program p with a state s and restores to the original 
state after p finishes. Thus (local s p`>= k) is different from local s (p >= k), and 
local should be modelled as a scoped operations of signature data Local s a = 
Local s a. Together with the usual algebraic operations get and put of state, 
Local can be interpreted with a functorial algebra carried by the state monad 
type State s a = s > (s,a). The essential part of the functorial algebra is the 
following enterE for Local (complete code in the supplementary material): 


enterE :: Local (State s (State s a)) + State s a 
enterE (Local s' f) s = let (_,k) =f sinks 


Example 7. Parallel composition of processes is not an operation in the usual 
algebraic presentations of process calculi precisely because it not alge- 
braic: (p | q) >= k # (p >= k) | (q >= k). Again, we can model it as a scoped 
operation, and different scheduling behaviours of processes can be given as dif- 
ferent functorial algebras. The supplementary material contains complete code 
of handling parallel composition using the so-called resumption monad par. 


3 Categorical Foundations for Scoped Operations 


We now move on to a categorical foundation for scoped effects and functorial 
algebras. First, we recall some standard category theory underlying algebraic 
effects and handlers and also Piróg et al. [46]’s monad P that 
models the syntax of scoped operations, which is exactly the Prog monad in 
the Haskell implementation (Section 3.2). Then, we define functorial algebras 
formally and show that there is an adjunction between the category 
of functorial algebras and the base category inducing the monad 
P, which provides a means to interpret the syntax of scoped operations. 

The rest of this paper assumes familiarity with basic category theory, such 
as adjunctions, monads, and initial algebras, which are covered by standard 
texts (6}/41/55). The mathematical notation in this paper is summarised in the 
appendices, which may be consulted if the meaning of some symbols are unclear. 
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3.1 Syntax and Semantics of Algebraic Operations 


The relationships between equational theories, Lawvere theories, monads, and 
computational effects are well-studied for decades from many perspectives 
[30] 45][48][54] [57]. Here we recap a simplified version of equational theories by 
Kelly and Power that we follow to model algebraic and scoped effects on 
locally finitely presentable (lfp) categories gl. 


Locally Finitely Presentable Categories The use of lfp categories in this paper 
is limited to some standard results about the existence of many initial algebras 
in lfp categories, and thus a reader not familiar with lfp categories may follow 
this paper with some simple intuition: a category C is lfp if it has all (small) 
colimits and a set of finitely presentable objects such that every object in C can be 
obtained by ‘glueing’ (formally, as filtered colimits of) some finitely presentable 
objects. For example, Set is lfp with finite sets as its finitely presentable objects, 
and indeed every set can be obtained by glueing, here meaning taking the union 
of, all its finite subsets: X = UJ {N C X | N finite}. Other examples of lfp 
categories include the category of partially ordered sets, the category of graphs, 
the category of small categories, and presheaf categories (we refer the reader 
to the excellent exposition for concrete examples), thus lfp categories are 
widespread to cover many semantic settings of programming languages. 

Moreover, an endofunctor F : C — C is said to be finitary if it preserves 
‘glueing’ (filtered colimits), which implies that its values FX are determined 
by its values at finitely presentable objects: FX © F(colim;N;) S colim; FN; 
where N; are the finitely presentable objects that generate X when glued to- 
gether. For example, polynomial functors [|] en Pn x (—)" on Set are finitary 
where Pn is a set for every n. 


Algebraic Operations on LFP Categories Fixing an lfp category C, we take fini- 
tary endofunctors X : C — C as signatures of operations on C. Like in 
the intuition is that every natural transformation Uc R,-) Po X- 
or some object P : C and a finitely presentable object R : C stands for an 
operation taking a parameter of type P and R-many arguments. The category 
X -Alg of X-algebras is defined as usual: it has pairs (X : C,a: YX > X) 
as objects and morphisms h : X — X’ such that h-a = a’: Xh as morphisms 
(X,a) — (X', a’). The following classical results (see e.g. PIB) give sufficient 
conditions for constructing initial and free X-algebras: 


Lemma 1. If category C has finite coproducts and colimits of all w-chains and 
functor X : C > C preserves them, then the forgetful functor Us : X-Alg > C 
forgetting the structure maps has a left adjoint Frees : C —> X-Alg mapping 
every X : C to a X-algebra (X* X, opx) where X*X denotes the initial algebra 
LY. X+ LY and opy : VL*X > L*X. 


[Lemma 1] is applicable to our setting since C being lfp directly implies that 
it has all colimits, and finitary functors X preserve colimits of w-chains because 
colimits of w-chains are filtered. Hence we have an adjunction: Frees 4 Uy : 
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X-Alg — C. We denote the monad from the adjunction by &* = UyFreey 
(which is implemented as the Free X monad in [Section 2.1). The idea is still 
that syntactic terms built from operations in X are modelled by the monad 
+/*, and semantics of operations are given by X’-algebras. Given any »/-algebra 
(X,a@: XX — X) and morphism g : A > X in C, they induce an interpretation 
morphism handle; x ajg : X*A > X s.t. 


handle; x ajg = Us (Elx a) ` Freesg) : X*A = UsFrees A —> X (9) 
where €(x,q) : Frees Us(X, a) — (X, a) is the counit of Frees 4 Uy. 


Algebraic Effects and Handlers The perspective of Plotkin and Pretnar is 
that computational effects are characterised by signatures X of primitive effectful 
operations, and they determine monads X* that model programs syntactically. 
Then X-algebras are handlers of operations that can be applied to programs 
using (op to give specific semantics to operations. 

The approach of algebraic effects has led to a significant body of research 
on programming with effects and handlers, but it imposes an assumption on 
the operations to be modelled: the construction of 3/* in [Lemma 1] implies 
that the multiplication u of the monad X* satisfies the algebraicity property: 
op- (Xou) = p- (opo X*) : SL*L* — X* where op : X(L*) > X*. This 
intuitively means that every operation in X must be commutative with sequential 
composition of computations. Many, but not all, effectful operations satisfy this 
property, and they are called algebraic operations. 


Adjoint Approach to Effects The crux of algebraic effects and handlers is the 
adjunction Frees 4 Us. However, we have not relied on the adjunction being 
the free/forgetful one at all: given any monad P : C > C that models the syntax 
of effectful Programs, if L 4 R : D —> C is an adjunction such that RL = P as 
monads, then objects D in D provide a means to interpret programs PA— for 
any g : A RD in C, we have the following interpretation morphism 


handlepg = R(ep- Lg): PA S R(LA) > RD (10) 


The intuition for g is that it transforms the returned value A of a computation 
into the carrier RD, so it corresponds to the ‘value case’ of effect handlers [8]. 
Piróg et al. call this approach the adjoint-theoretic approach to syntax and 
semantics of effects, and they construct an adjunction between indexed algebras 
and the base category for modelling scoped operations. Earlier, Levy and 
Kammar and Plotkin also adopt a similar adjunction-based viewpoint in the 
treatment of call-by-push-value calculi: value types are interpreted in the base 
category C, and computation types are interpreted in the algebra category D. 


Remark 1. A notable missing part of our treatment is the equations that specify 
operations in a signature. Following Kelly and Power [80], an equation for a signa- 
ture X : C — C can be formulated as a pair of monad morphisms o,r : T* > X* 


for some finitary functor I’, and taking their coequaliser I™* =. 2* —» M in 
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the category of finitary monads constructs a monad M that represents terms 
modulo the equation l = r. Although it seems straightforward to extend this 
formulation of equational theories work with scoped effects, we do not consider 
equations in this paper for the sake of simplicity. 


Remark 2. Working with lfp categories precludes operations with infinite argu- 
ments, such as the get operation of mutable state when the state has infinite 
possible values, but this limitation is not inherent and can be handled by moving 
to locally K-presentable categories |1| for some larger cardinal «. 


3.2 Syntax of Scoped Operations 


Not all operations in programming languages can be adequately modelled as 
algebraic operations on Set, for example, A-abstraction (16), memory cell gener- 
ation [38|[48}, more generally, effects with dynamically generated instances [62], 
explicit substitution (18), channel restriction in 7-calculus [61], and their syntax 
are usually modelled in some functor categories. More recently, Piróg et al. 
extend Ghani and Uustalu (18)’s work to model a family of non-algebraic op- 
erations, which they call scoped operations. In this subsection, we review their 
development in the setting of lfp categories. Throughout the rest of the paper, 
we fix an lfp category C, and refer to it as the base category, and it is intended 
to be the category in which types of a programming language are interpreted. 
Furthermore, we fix two finitary endofunctors X, I : C > C and call them the 
algebraic signature and scoped signature respectively. 


Syntax Endofunctor P Now our goal is to construct a monad P : C > C that 
models the syntax of programs with algebraic operations in X and non-algebraic 
scoped operations in I’. First we construct its underlying endofunctor. When C 
is Set, the intuition for programs PA is that they are terms inductively built 
from the following inference rules: 


acA oEXn k:n—> PA sE€In p:n> PX k:X—>PA 
var(a) € PA o(k) € PA {s(p);k} € PA 


where n ranges over finite sets and o € Xn represents an algebraic operation 
of |n| arguments, and similarly s € I'n is a scoped operation that creates |n| 
scopes. The difference between algebraic and scoped operations is manifested by 
an additional explicit continuation k in the third rule, as it is not the case that 
sequentially composing s(p) with k equals s(p; k) like for algebraic operations, 
so the continuation for scoped operations must be explicitly kept in the syntax. 
When C is any lfp category, these rules translate to the following recursive 
equation for the functor P : C > C: 


PAS A+ Z(PA)+ flex pay P(PX) (11) 


where the existentially quantified X in the third rule is translated to a coend 
‘as in C [ai]. Moreover, the coend in is isomorphic to '(P(PA)) because 
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by the coend formula of Kan extension, it exactly computes Lan;(I’P)(PA), i.e. 
the left Kan-extension of I'P along the identity functor I : C —> C, and by 
definition Lanz(I P) = I'P. Thus (11) is equivalent to 


PAS A+5(PA)+I(P(PA)) (12) 


which is exactly the Prog X I’ datatype that we saw in the Haskell implementa- 
tion (7). To obtain a solution to (12), we construct a (higher-order) endofunctor 
G : Endos(C) —> Endo,;(C) to represent the Grammar where Endo;(C) is the 
category of finitary endofunctors on C: 


G=Id+No-—+TIo-o (13) 


where Id : C — C is the identity functor. Then [Lemma T] is applicable be- 
cause Endo (C) has all small colimits since colimits in functor categories can be 
computed pointwise and C has all small colimits. Furthermore, G preserves all 
filtered colimits, in particular colimits of w-chains, because — o = : Endos(C) x 
Endo; (C) > Endo;(C) is finitary following from direct verification. Since initial 
algebras are precisely free algebras generated by the initial object, by [Lemma 1] 
there is an initial G-algebra (P : Endo s(C), in : GP — P) and in is an isomor- 
phism. Thus P obtained in this way is indeed a solution to (12)—the endofunctor 
modelling the syntax of programs with algebraic and scoped operations. 


Monadic Structure of P Next we equip the endofunctor P with a monad struc- 
ture. This can be done in several ways, either by the general result about 
X-monoids in Endo;(C), or by |43| Theorem 4.3], or by the following 
relatively straightforward argument in |46|: by the ‘diagonal rule’ of comput- 
ing initial algebras by Backhouse et al. |4|, P = uG is isomorphic to 
Pi = uX. Id+ VoX +To0PoX. Note that P’ is exactly (X + T o P)* as 
endofunctors by [Lemma 1} thus 


P(X +T oP)" : Endos(C) (14) 


Then we equip P with the same monad structure as the ordinary free monad 
(X +T o P)*. The implementation in is exactly this monad structure. 


3.3 Functorial Algebras of Scoped Operations 


To interpret the monad P modelling the syntax of scoped operations, it 
is natural to expect that semantics is given by G-algebras on Endo;(C) so that 
interpretation is then the catamorphisms from uG to G-algebras. And follow- 
ing the adjoint-theoretic approach (10), we would like to have an adjunction 
G-Alg EN C such that the induced monad is isomorphic to P. However, there 
seems no natural way to construct such an adjunction unless we replace G- 
algebras with a slight extension of it, which we referred to as functorial algebras, 
as the notion for giving semantics to scoped operations. In the following, we first 


define functorial algebras formally (Definition 1) and then show the adjunction 
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between the category of functorial algebras and the base category (Theorem 1}, 
which allows us to interpret P with functorial algebras. 

A functorial algebra is carried by an endofunctor H : C — C with additionally 
an object X in C. The endofunctor H also comes with a morphism af : GH + H 
in Endo-(C), and the object X is equipped with a morphism a? : XX +I HX > 
X in C. The intuition is that given a program of type PX = X + (PX) + 
I'(P(PX)), the middle P in PP corresponds to the part of a program enclosed 
by some scoped operations (i.e. the p in {s(p)>=k}), and this part of the program 
is interpreted by H with a@. After the enclosed part is interpreted, a? interprets 
the outermost layer of the program by X with a/ in the same way as interpreting 
free monads of algebraic operations. More precisely, let I : Endos(C) x C 4 C 
be a bi-functor such that 


InX = XX +T(HX) Iof =Sf+I(o-Hf) (15) 


for all H : Endoys(C) and X : C and all morphisms o : H —> H’ and f : X > X’. 
Then we define an endofunctor Fn: Endos(C) x C > Endos(C) x C such that 


Fn(H, X) = (GH, Ig X) (16) 


Definition 1. A functorial algebra is an object (H, X) in Endos(C) x C paired 
with a structure map Fn(H,X) > (H, X}, or equivalently it is a quadruple 


(H:Endos(C), X:C, a%:GHoOH, a’ :XX+I(HX)> X) 


where GH = Id+ 330 H+ o HoH. Morphisms between two functorial algebras 
(Hı, X1,0¢, al) and (Ho, X2,09,a04) are pairs (o : Hy > Ho, f : Xı > Xə) 
making the following diagrams commute: 


a? at 
GH, 2an Hı LXi a= T'(A,X,) ee Xı 
Go| Je Ef+T(oo fy] | 
GH oe Hə XX + I'(H2X2) — ae Xə 
az or) 


Functorial algebras and their morphisms form a category Fn-Alg. 


Example 8. We reformulate our programming example of nondeterministic choice 
with once shown|Example 4ļin the formal definition. Let C = Set in this example 
and 1 = {x} be some singleton set. We define signature endofunctors 


SX =14+ Xx xX TX =X 


so that X represents nullary algebraic operation fail and binary algebraic opera- 
tion or, and I’ represents the unary scoped operation once that creates one scope. 
Let List : Set — Set be the endofunctor mapping a set X to the set of finite lists 


3 The first argument H to I is written as subscript so that we have a more compact 
notation Iž when taking the free monad of Ip : CC with the first argument fixed. 
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with elements from X. We define natural transformations a* : X o List + List 
and a! : T o List o List > List by 


ax (t1 x)= nil, alı (z,y)) =x +y, aX(nil) = nil, aX(cons x rs) =x 


where nil is the empty list; + is list concatenation; and cons x xs is the list 
with an element x in front of zs. Then for any set X, (List, List X) carries a 
functorial algebra with structure maps 


a? = n*a, a]: GList > List a7 = [o¥, ak]: Irni X >X (17) 


List 


where n : Id + List wraps any element into a singleton list. 


The last example exhibits that one can define a functorial algebra carried 
by (H, HX) from a G-algebra on H : Endos(C) by simply choosing the object 
component to be HX for an arbitrary X : C. In other words, there is a faithful 
functor G-Alg — Fn-Alg, which results in functorial algebras that interpret the 
outermost layer of a program—the part not enclosed by any scoped operation— 
in the same way as the inner layers. But in general, the object component of 
functorial algebras offers the flexibility that the outermost layer can be inter- 
preted differently from the inner layers, as in the following example. 


Example 9. Continuing |Example 8| if one is only interested in the final number 
of possible outcomes, then one can define a functorial algebra (List, N, aC, a‘) 
where aŭ is and a? (v1 (u1*)) = 0, 


al (t (t2(2,y))) =a+y, a (t2 nil) =0, at (tg (cons n ns)) =n 


3.4 Interpreting with Functorial Algebras 


In the rest of this section we show how functorial algebras can be used to in- 
terpret programs PA with scoped operations. We first construct a simple 
adjunction + 4 | between the base category C and Endo;(C) x C, which is then 
composed with the free/forgetful adjunction Freep, 4 Urn between Endos(C) x C 
and Fn-Alg for the functor Fn (16). The resulting adjunction is proven to 
induce a monad T isomorphic to P (Theorem 1), and by the adjoint-theoretic 
approach to syntax and semantics (10), this adjunction provides a means to 
interpret scoped operations modelled with the monad P (Theorem 2). 

First we define functor f : C + Endo,;(C) x C such that ¢ X = (0,X) where 
0 : Endo;(C) is the initial endofunctor—the constant functor sending everything 
to the initial object in C. The functor f is left adjoint to the projection functor 
{ : Endos(C) x C > C of the second component. 

Then we would like to compose ¢ 4 {| with the free-forgetful adjunction 
Freep, 4 Urn for the endofunctor Fn on Endo,(C) x C, and the latter ad- 
junction indeed exists. 


Lemma 2. The endofunctor Fn (16) on Endo;(C) x C has free algebras, i.e. 
there is a functor Freep, : Endos(C) x C > Fn-Alg left adjoint to the forgetful 
functor Up, : Fn-Alg —> Endo,(C) x C. 
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These two adjunctions are depicted in the following diagram: 


Freepn t 
Fn-Alg ‘I, Endoy(C) x C TL, C r (18) 
+ 


and we compose them to obtain an adjunction Freez, f 4 | Up, between Fn-Alg 
and C, giving rise to a monad T = | Up,Freer, t. In the rest of this section, we 
prove that T is isomorphic to P (11) in the category of monads, which is crucial 
in this paper, since it allows us to interpret scoped operations modelled by the 
monad P with functorial algebras Fn-Alg. 

We first establish a technical lemma characterising the free Fn-algebra on the 
product category Endo;(C) x C in terms of the free algebras in C and Endo,(C). 


Lemma 3. There is a natural isomorphism between Freer, and the following 
Freem(H,X) =(G"H : Endos(C), (laxn)"X:C, opg, opge) 


where op% : G(G*H) > G*H and opga? : Icy (Ien X) > (Icu) X 
are the structure maps of the free G-algebra and Ia» -algebra respectively. 


Theorem 1. Monads P and T EE) are isomorphic as monads. 


Remark 3. In general, the right adjoint | Up, is not monadic since it does not re- 
flect isomorphisms, which is a necessary condition for it to be monadic by Beck’s 
monadicity theorem [41]. This entails that the category Fn-Alg of functorial alge- 
bras is not equivalent to the category of Eilenberg-Moore algebras. Nonetheless, 
as we will see later in[Section 4] functorial algebras and Eilenberg-Moore algebras 
have the same expressive power for interpreting scoped operations. 


The isomorphism established enables us to ae ee programs 
modelled by the ened P using functorial gebras following (10): for any func- 
torial algebra (H, X,a%, až) , and any R g:A—>Xin 
the base category C, eres isa a 


handle; H x ac a1) 9 = + Ural EH X, ac at): Freemfg): TAS PAX (19) 


which interprets programs PA with the functorial algebra (H, X, aC, a1). Fur- 
thermore, we can derive the following recursive formula for this interpreta- 
tion morphism, which is exactly the Haskell implementation in 


Theorem 2 (Interpreting with Functorial Algebras). For any functorial 
algebra a = (H,X,a%,a') as in and any morphism g : A > X 
for some A in the base category C, let h = (aF) : P > H be the catamorphism 
from the initial G-algebra P to the G-algebra a® : GH —> H. The interpretation 
of PA with this algebra a and g satisfies 


handleg g = |g, at X(handlea g), a}--Phx-I'P(handleg g)|- in% (20) 


where in? : P > nF SioP+IoPoP is wie uay between P and GP; 


morphisms af = a! : XX > X and a} = a! : THX —> X are the two 


components of at "OX +THX > X. 
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To summarise, we have defined a notion of functorial algebras that we use to 
handle scoped operations. The heart of the development is the adjunction 
that induces a monad isomorphic to the monad P that models the syntax of 
programs with scoped operations, following which we derive a recursive formula 
that interprets programs with functor algebras. The formula is exactly the 
implementation in [Figure 1] the datatype EndoAlg represents the aĉ in (20); 
datatype BaseAlg corresponds to af; function hcata implements (a@). 


4 Comparing the Models of Scoped Operations 


Functorial algebras are not the only option for interpreting scoped operations. In 
this section we compare functorial algebras with two other approaches, one being 
Piróg et al. [46)’s indexed algebras and the other one being Eilenberg-Moore (EM) 
algebras of the monad P (12), which simulate scoped operations with algebraic 
operations. After a brief description of these two kinds of algebras, we compare 
them and show that their expressive power is in fact equivalent. 


4.1 Interpreting Scoped Operations with Eilenberg-Moore Algebras 


In standard algebraic effects, handlers are just »/-algebras for some signature 
functor X : C > C, and it is well known that the category X -Alg of X-algebras 
is equivalent to the category C~” of EM algebras of the monad 5*. Thus handlers 
of algebraic operations are exactly EM algebras of the monad X* modelling the 
syntax of algebraic operations. This observation suggests that we may also use 
EM algebras of the monad P as the notion of handlers for scoped operations. 


Lemma 4. EM algebras of P are equivalent to (X + Io P)-algebras. In other 
words, an EM algebra of P is equivalently a tuple 


(X :C, ay: YX >X, ap: (PX) > X) (21) 


Thus we obtain a way of interpreting scoped operations based on the adjunc- 
tion Frees;prop 4 Us+rop: given an EM algebra a = (X,as,ar) of P as in 
(21), then for any A: C and morphism g : A > X, the interpretation of PA by 
g and this EM algebra is 


handlea g = Un4roPl€a: Freesyrop g): PAS(X+roP¥A—>X (22) 
The formula can also be turned into a recursive form: 
handlea g = |g, ax: X(handleg g), ar-IP(handleg g)| + in% (23) 


that suits implementation (see the appendices for more details). 

Interpreting scoped operation with EM algebras can be understood as sim- 
ulating scoped operations with algebraic operations and general recursion: a 
signature (X, T) of algebraic-and-scoped operations is simulated by a signature 
(X+I0P) of algebraic operations where P is recursively given by (7+IoP)*. In 
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this way, one can simulate scoped operation in languages implementing algebraic 
effects that allow signatures of operation to be recursive, such as (7)[19]/36), but. 
not the original design by Plotkin and Pretnar [52], which requires signatures of 
operations to mention only some base types. 

The downside of this simulating approach is that the denotational seman- 
tics of the language becomes more complex and usually involves solving some 
domain-theoretic recursive equations, like in [7]. Moreover, this approach typi- 
cally requires handlers to be defined with general recursion, which obscures the 
inherent structure of scoped operations, making reasoning about handlers of 
scoped operations more difficult. 


4.2 Indexed Algebras of Scoped Effects 


Indexed algebras of scoped operations by Piróg et al. are yet another way 
of interpreting scoped operations. They are based on the following adjunction: 


< Freezz < t 
In-Alg L chI i C (24) 


Urs > 


where CI is the functor category from the discrete category |N| of natural 
numbers to the base category C. That is to say, an object in Cl! is a family of 
objects A; in C indexed by natural numbers i € |N|, and a morphism 7 : A > B 
in CMI isa family of morphisms 7; : A; — B; in C (with no coherence conditions). 
An endofunctor Ia: CNI — CMI is defined to characterise indexed algebras: 


IzA = Xo A +T 0(<A) + (œA) 


where < and > are functors CNI > CI shifting indices such that (<A); = Ai+1 
and (>A)g = 0 and (>A);i+ı = A;. Then objects in Is-Alg are called indexed 
algebras. Furthermore, since a morphism (>A) — A is in bijection with A > 
(<A), an indexed algebra can be given by the following tuple: 


(A:CIN, a: 00A>4A, d:IT(sA) 3 A, p: A> 4A) (25) 


The operational intuition for it is that the carrier A; at level i interprets the 
part of syntax enclosed by 7 layers of scopes, and when interpreting a scoped 
operation I (P(PX)) at layer i, the part of syntax outside the scope is first 
interpreted, resulting in I (PA;), and then the indexed algebra provides a way 
p to promote the carrier to the next level, resulting in T (PAi+1). After the 
inner layer is also interpreted as [’A;11, the indexed algebra provides a way d to 
demote the carrier, producing A; again. Additionally the morphism a interprets 
ordinary algebraic operations. 


Example 10. [Example 8] for nondeterministic choice with once can be expressed 
with an indexed algebra as follows. For any set X, we define an indexed object 
A: CNI by Ao = List X and Aji1 = List A;. The object A carries an indexed 
algebra with the following structure maps: for all i € N, a;(t1 *) = nil and 


ailto (x, y)) = xz +y, d;(nil) = nil, d;(cons z zs) =a, p;(x) = cons x nil 
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The adjunction Freez, 4 Uz, in is the free-forgetful adjunction for Iz 
on CI, The other adjunction H] is given by | A = Ao, ([ X)o = X, and 
(| X)i41 = 0 for all i € N. Importantly, Piróg et al. show that the monad 
induced by the adjunction is isomorphic to monad P (12), thus indexed 
algebras can also be used to interpret scoped operations 


handle; 4 a,d,p) g =| Ure(€(A,a,d,p) ‘Freez, l g) (26) 


in the same way as what we do for functorial algebras in|Section 3.4| Interpreting 
with indexed algebras can also be implemented in Haskell with GHC’s DataKinds 
extension for type-level natural numbers (which can be found in the appendices). 


4.3 Comparison of Resolutions 


Now we come back to the real subject of this section—comparing the expressivity 
of the three ways for interpreting scoped operations. Specifically, we construct 
comparison functors between the respective categories of the three kinds of al- 
gebras, which translate one kind of algebras to another in a way preserving the 
induced interpretation in the base category. Categorically, the three kinds of 
algebras correspond to three resolutions of the monad P, which form a cate- 
gory of resolutions with comparison functors as morphisms. In 
this category, the Eilenberg-Moore resolution is the terminal object, and thus 
it automatically gives us comparison functors translating other kinds of alge- 
bras to EM algebras. To complete the circle of translations, we then construct 
comparison functors K®™ : CP — Fn-Alg translating EM algebras to functorial 


ones (Section 4.4) and Kf : Fn-Alg —> Ia-Alg translating functorial algebras to 
indexed ones (Section 4.5). 


Definition 2 (Resolutions (35}). Given a monad M on C, the category 
Res(M) of resolutions of M has as objects adjunctions (D, LD 4 R : D > C,n, €) 
whose induced monad RL is M. A morphism from a resolution (D, L 4 R, 1, e€) 
to (D’, L’ 4 R', 7’, €') is a functor K : D > D’, called a comparison functor, such 
that it commutes with the left and right adjoints, i.e. KL = L' and R'K = R. 


We have seen adjunctions for indexed algebras, EM algebras and functorial 
algebras respectively, each inducing the monad P up to isomorphism, so each of 
them can be identified with an object in the category Res( E). For each resolution 
(D, L, R, n, €), we have been using the objects D in D to interpret scoped opera- 
tions modelled by P: for any morphism g : A > RD in C, the interpretation of 
PA by D and g is handlep g = R(ep- Lg) : PA = RLA > RD. Crucially, we 
show that interpretations are preserved by comparison functors. 


Lemma 5 (Preservation of Interpretation). Let K :D— D’ be any com- 
parison functor between resolutions (D, L,R,n,¢) and (D’, L’, R’,n’,€') of some 
monad M : C —> C. For any object D in D and any g : A > RD inC, 


handlep g = handlexp g : MA > RD(= R'KD) (27) 


where each side interprets MA using L 4 R and L' ~ R' respectively. 
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This lemma implies that if there is a comparison functor K from some reso- 
lution L4 R : D > C to L’ 4 R’: D’ > C of the monad P, then K can translate 
a D object to a D’ object that preserves the induced interpretation. Thus the 
expressive power of D for interpreting P is not greater than D’, in the sense 
that every handlep g that one can obtain from D in D can also be obtained by 
an algebra KD in D’. Thus the three kinds of algebras for interpreting scoped 
operations have the same expressivity if we can construct a circle of comparison 
functors between their categories, which is what we do in the following. 


Translating to EM Algebras As shown in (41], an important property of the 
Kilenberg-Moore adjunction is that it is the terminal object in the category 
Res(M) for any monad M, which means that there uniquely exists a comparison 
functor from every resolution to the Eilenberg-Moore resolution. Specifically, 
given a resolution (D, L, R,7,¢) of a monad M, the unique comparison functor 
K from D to the category C™ of the Eilenberg-Moore algebras is 


KD =(M(RD)=RLRD == RD) and K(D4SD')=Rf 


Lemma 6. There uniquely exist comparison functors KE : Ix-Alg + C? and 
K® : Fn-Alg —> C? from the resolutions of indexed algebras and functorial 
algebras to the resolution of EM algebras. 


4.4 Translating EM Algebras to Functorial Algebras 


Now we construct a comparison functor K™ : CP — Fn-Alg translating EM 
algebras to functorial ones. The idea is straightforward: given an EM algebra X, 
we map it to the functorial algebra with X for interpreting the outermost layer 
and the functor P for interpreting the inner layers, which essentially leaves the 
inner layers uninterpreted before they get to the outermost layer. 

Since C? is isomorphic to (+I o P)-Alg, we can define K® on (1+ Io P)- 
algebras instead. Given any (X : C,a : (X +I oP)X —> X), it is mapped by 
i to the functorial algebra 


(P, X, in: GP >P, a: (X +T o0P)X > X) 


and for any morphism f in (X + I o P)-Alg, it is mapped to (idp, f}. To show 
K is a comparison functor, we only need to show that it commutes with the left 
and right adjoints of both resolutions. Details can be found in the appendices. 


Lemma 7. Functor K¥ is a comparison functor from the Eilenberg-Moore res- 
olution of P to the resolution Freer, t 4 | Urn of functorial algebras. 


4.5 Translating Functorial Algebras to Indexed Algebras 


KE KE 
At this point we have comparison functors Ix-Alg — CP —*> Fn-Alg. To 


complete the circle of translations, we construct a comparison functor KẸ : 
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Fn-Alg — Ia-Alg in this subsection. The idea of this translation is that given 
a functorial algebra carried by endofunctor H : C? and object X : C, we map 
it to an indexed algebra by iterating the endofunctor H on X. More precisely, 
K? : Fn-Alg — Iax-Alg maps a functorial algebra 


(AO, X:C, af: Ia+X0H+roHoH >H, œa! t 2X +TAX > X) 


to an indexed algebra carried by A: CI such that A; = HŻX, i.e. iterating 
H i-times on X. The structure maps of this indexed algebra (a: XA > A, d: 
I'(<A) > A, p: A —> (<A)) are given by 

ao = (a? +4): DX > X Qi = (aix 12): DHH'X > HX 

do = (a 12): THX >X = dizi =(AGiy 13): PHHH'X > HX 


and pi = aiy: : H'X > HH'X. On morphisms, Kẹ maps a morphism 
(r: H > H', f : X > X’) in Fn-Alg too: HX > H" X' in Iz-Alg such that 
oo = f and i41 = 700; where o is horizontal composition. 


Lemma 8. K? is a comparison functor from the resolution Freer, Î | Urn of 
functorial algebras to the resolution Freez, | 4 | Urs of indexed algebras. 


Since comparison functors preserve interpretation (Lemma 5}, the lemma 
above implies that the expressivity of functorial algebras is not greater than 
indexed ones. Together with the comparison functors defined earlier, we con- 
clude that the three kinds of algebras—indexed, functorial and Eilenberg-Moore 
algebras—have the same expressivity for interpreting scoped operations. 


Remark 4. Although the three kinds of algebras have the same expressivity in 
theory, they structure the interpretation of scoped operations in different ways: 
EM algebras impose no constraint on how the part of syntax enclosed by scopes is 
handled; indexed algebras demand them to be handled layer by layer but impose 
no coherent conditions between the layers; functorial algebras additionally force 
all inner layers must be handled in a uniform way by an endofunctor. 

On the whole, it is a trade-off simplicity and structuredness: EM algebras 
are the simplest for implementation, whereas the structuredness of functorial 
algebras make them easier to reason about. This is another instance of the 
preference for structured programming over unstructured language features, in 
the same way as structured loops being favoured over goto, although they have 
the same expressivity in theory (13}. 


5 Fusion Laws of Interpretation 


An advantage of the adjoint-theoretic approach to syntax and semantics is that 
the naturality of an adjunction directly offers fusion laws of interpretation that 
fuse a morphism after an interpretation into a single interpretation, which have 
proven to be a powerful tool for reasoning about and optimising programs ma- 


nipulating abstract syntax and in particular handlers of algebraic 
effects [69][73]. In this section, we present the fusion law for functorial algebras. 
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5.1 Fusion Laws of Interpretation 


Recall that given any resolution L 4 R with counit € of some monad M : C > C 
where L : C —> D, for any g : A— RD, we have an interpretation morphism 


handlep g = R(ep - Lg): MA > RD 


Then whenever we have a morphism in the form of (f - handlep g)—an interpre- 
tation followed by some morphism—the following fusion law allows one to fuse 
it into a single interpretation morphism. 


Lemma 9 (Interpretation Fusion). Assume L- R is a resolution of monad 
M : C > C where L: C —> D. For every D : D, g: A> RD and f : RD > X, 
if there is some D' and h: D —> D' in D such that RD' = X and Rh = f, then 


f-handlep g = handlen (f - g) (28) 


Applying the lemma to the three resolutions of P gives us three fusion laws: 
for any D : D where D € {Iz-Alg, Fn-Alg, C?” }, one can fuse f- handlep g into 
a single interpretation if one can make f a D-homomorphism. Particularly, the 
following is the fusion law for functorial algebras. 


Corollary (Fusion Law for Functorial Algebras). Let dı = (H, Xi, af, a2) 
be a functorial algebra and g : A —> Xı, f : Xı > Xə be any 
morphisms in C. If there is a functorial algebra do = (Hz, X2,a¢,04) and a 
functorial algebra morphism (o : Hı > Ho,h: Xı > Xe), then 


f: handle, g = handles, (f +g) 


Example 11. Let â be the functorial algebra of nondeterminism with once in 


and len : List A + N be the function mapping a list to its length. 
Then using the fusion law, len- handlea g = handleg (len-g) if we can find a 


suitable functorial algebra B : Fn-Alg and h : â —> b s.t. | Urrh = len. In fact, a 
suitable is just the functorial algebra ar Gland h = (id, len). 
Example 12. Although Piróg et al. propose the adjunction to interpret 
scoped operations with indexed algebras, their Haskell implementation is not 
a faithful implementation of the interpretation morphism (26), but rather a 
more efficient one skipping the step of transforming P to the isomorphic free 
indexed algebra (| UrsFreers |). However, it is previously unclear whether this 
implementation indeed coincides with the interpretation morphism due to 
the discrepancy between the syntax monad P and indexed algebras. 

This issue is in fact one of the original motivations for us to develop functo- 
rial algebras—a way to interpret P that directly follows the syntactic structure. 
Using the comparison functors to transform between indexed and functorial al- 
gebras, we can reason about Piróg et al. (46)’s implementation with functorial 


algebras, and its correctness can be established using fusion laws. This extended 
case study is in the appendices. 
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6 Related Work 


The most closely related work is that of Piróg et al. on categorical models of 
scoped effects. That work in turn builds on Wu et al. who introduced the no- 
tion of scoped effects after identifying modularity problems with using algebraic 
effect handlers for catching exceptions 52. Scoped effects have found their way 
into several Haskell implementations of algebraic effects and handlers (32|[42][56). 


Effect Handlers and Modularity Spivey [60], Moggi and Wadler initiated 
monads for modeling and programming with computational effects. Soon after, 
the desire arose to define complex monads by combining modular definitions 
of individual effects [26}/63), and monad transformers were developed to meet 
this need [89]. Yet, several years later, algebraic effects were proposed as an 
alternative more structured approach for defining and combining computational 
effects (22}/48}/49). The addition of handlers has made them practical for 
implementation and many languages and libraries have been developed since. 
Schrijvers et al. have characterised modular handlers by means of modular 
carriers, and shown that they correspond to a subclass of monad transformers. 

Scoped operations are generally not algebraic operations in the original design 
of algebraic effects [48], but as we have seen in [Section 4.1] an alternative view 
on Eilenberg-Moore algebras of scoped operations is regarding them as handlers 
of algebraic operations of signature X + I'P. However, the functor X + IP 
involves the type P modelling computations, and thus it is not a valid signature 
of algebraic effects in the original design of effect handlers (51]52], in which the 
signature of algebraic effects can only be built from some base types to avoid 
the interdependence of the denotations of signature functors and computations. 
In spite of that, many later implementations of effect handlers such as EFF [7], 
KOKA and FRANK do not impose this restriction on signature functors 
(at the cost that the denotational semantics involves solving recursive domain- 
theoretic equations), and thus scoped operations can be implemented in these 
languages with EM algebras as handlers. 

Other variations of scoped effects have been suggested. Recently, Poulsen 
et al. and van den Berg et al. pi have proposed a notion of staged or 
latent effect, which is a variant of scoped effects, for modelling the deferred 
execution of computations inside lambda abstractions and similar constructs. 
Ahman and Pretnar investigate asynchronous effects, and they note that 
interrupt handlers are in fact scoped operations. We have not yet investigated 
this in our framework, but it will be an interesting use case. 


Abstract Syntax This work focusses on the problem of abstract syntax and se- 
mantics of programs. The practical benefit of abstract syntax is that it allows for 
generic programming in languages like Haskell that have support for, e.g. type 
classes, GADTS and so on. As an example, Swierstra showed that it is 
possible to modularly create compilers by formalising syntax in Haskell. 

Fiore et al. first formalise abstract syntax categorically for operations 
with variable binding. Subsequently, Ghani and Uustalu model the abstract 
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syntax of explicit substitutions as an initial algebra in the endofunctor category 
and show that it is a monad. Piróg et al. and this paper use a monad P, 
which is a slight generalisation of the monad of explicit substitutions, to model 
the syntax of scoped operations. The datatype underlying P is an instance of 
nested datatypes studied by Bird and Paterson and Johann and Ghani [24]. 

In this paper we have not treated equations on effectful operations, which 
are both theoretically and practically important. Plotkin and Power show 
that theories of various effects with suitable equations determine their corre- 
sponding monads, and later Hyland et al. show that certain combinations of 
effect theories are equivalent to monad transformers. Equations are also used for 
reasoning about programs with algebraic effects and handlers [84|[50}[73]. Pos- 
sible ways to extend scoped effects with equations include the approach in 
(Remark 1), the categorical framework of equational systems 4, second order 
Lawvere theories [5], and syntactic frameworks like (62). 


7 Conclusion 


The motivation of this work is to develop a structured approach to the syntax 
and semantics of scoped operations. We believe our proposal, functorial alge- 
bras, is at a sweet spot in the trade-off between structuredness and simplicity, 
allowing practical examples of scoped operations to be programmed and rea- 
soned about naturally, and implementable in modern functional languages such 
as Haskell and OCaml. We put our model and two other models for interpret- 
ing scoped effects in the same categorical framework, and we showed that they 
have equivalent expressivity for interpreting scoped effects, although they form 
non-equivalent categories. The uniform theoretical framework also induces fusion 
laws of interpretation in a straightforward way. 

There are two strains of work that should be pursued from here. The first 
one would be investigating ways to compose algebras of scoped operations. The 
second one would be the design of a language supporting handlers of scoped 
operations natively and its type system and operational semantics. 
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Abstract. Regions are a useful tool for the safe and automatic manage- 
ment of resources. Due to their scarcity, resources are often limited in 
their lifetime which is associated with a certain scope. When control flow 
leaves the scope, the resources are released. Exceptions can non-locally 
exit such scopes and it is important that resources are also released in 
this case. 

Continuation-passing style is a useful compiler intermediate language 
that makes control flow explicit. All calls are tail calls and the runtime 
stack is not used. It can also serve as an implementation technique for 
control effects like exceptions. In this case throwing an exception means 
jumping to a continuation which is not the current one. 

How is it possible to offer region-based resource management and excep- 
tions in the same language and translate both to continuation-passing 
style? In this paper, we answer this question. We present a typed lan- 
guage with resources and exceptions, and its translation to continuation- 
passing style. The translation can be defined modularly for resources and 
exceptions — the correct interaction between the two automatically arises 
from simple composition. We prove that the translation preserves well- 
typedness and semantics. 


1 Introduction 


Regions were originally introduced for the safe and automatic management of 
memory [33]. Since then, much research extended their usefulness for memory 
management in different scenarios [9, 12-14]. Regions are also a useful tool for 
controlling the allocation, release, and use of any kind of scarce resource even 
when considering memory to be plentiful [19]. Resources are organized into a 
stack of regions which corresponds to nested scopes in the program. Resources 
in a region are automatically released when control flow leaves the corresponding 
scope. A type-and-region system guarantees resource safety, i.e., that there is 
no access to a resource outside of its corresponding scope. 

Exceptions allow for non-local exits from scopes. It is important that re- 
sources are released not only upon normal return, but also when an exception is 
thrown. A type-and-effect system statically ensures that certain error conditions 
do not occur when running a program. In the case of exceptions, for example, 
we want to guarantee exception safety, i.e., every exception is eventually caught. 
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Some work on regions explicitly caters to exceptions [14, 18, 19, 32]. Still, the 
interaction between regions, exceptions, and first-class functions is non-trivial. 
To the best of our knowledge region safety for a language with this combination 
of features has not yet been formally established. 

Continuation-passing style (CPS) is an attractive [2, 8, 17] intermediate rep- 
resentation for programs. Control flow is explicit and many program optimiza- 
tions amount to simple inlining and beta reduction. CPS can also be an imple- 
mentation technique for control effects like exceptions [16, 17, 26]. Optimization 
of programs using these features still amounts to inlining and reduction. In CPS 
all calls are tail calls. Importantly, there is no runtime stack that a thrown excep- 
tion unwinds. Instead, throwing an exception means jumping to a continuation 
other than the current one. 


A CPS translation (from a source to a target language in CPS) must of course 
be correct, i.e. preserve the semantics of the source language. Ideally, the target 
language is also typed, and the translation takes well-typed terms to well-typed 
terms. Moreover, when we translate a source program with exceptions to CPS, 
well-typedness of the target term should also entail exception safety. However, 
there is not yet a single CPS translation for both exceptions and resource man- 
agement in the same language. Moreover, since in CPS there is no stack, it is not 
possible to run cleanup actions during unwinding. Therefore it is not clear how 
such a combination in CPS could guarantee proper release of resources when an 
exception is thrown. 


We present an intermediate language A, with resources and exceptions. It 
has a type-and-effect system keeping track of regions to model both: the lifetime 
of resources as well as the scope of exception handlers. We define its opera- 
tional semantics as an instrumented [23] abstract machine, which manipulates 
a runtime stack. We prove progress (Theorem 1) and preservation (Theorem 2) 
for this semantics in the proof assistant Coq. Resource safety (Corollary 1) and 
exception safety (Corollary 2) follow as corollaries. To our knowledge, this is 
the first proof of safety for a language with region-based resource management, 
exceptions, and first-class functions. 

We define a CPS translation from A, to System F with base types and primi- 
tive operations. The translation takes well-typed terms to well-typed terms (The- 
orem 3). We implemented the translation as a shallow embedding into the de- 
pendently typed language Idris 2. It does not use any special runtime constructs, 
neither for regions nor for exceptions. The translation is correct: translated terms 
simulate the abstract machine semantics step-wise (Theorem 4). This entails re- 
source safety and exception safety for CPS translated terms. 

Our key technical idea is to understand regions as describing the runtime 
stack. In the operational semantics, language constructs for resources and ex- 
ceptions push freshly generated markers onto the runtime stack. At runtime, 
a region stands for the concrete list of markers on the stack and subregioning 
evidence stands for the concrete difference between two such lists. In CPS there 
is no stack. Under our CPS translation, regions are answer types [30], and subre- 
gioning evidence terms are answer-type coercing functions. They move from one 
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region to another one. This allows us to define the CPS translations of resource 
management and exceptions separately while having them interact correctly. 

The rest of the paper is organized as follows. In Section 2, we introduce 
the main ideas behind our language A,. In Section 3, we formally present Ap. 
We start with a base language with type-level region tracking and term-level 
subregioning evidence. We gradually extend this base language with region-based 
resource management and exceptions. In Section 4, we define the CPS translation 
for A, to System F. We do so gradually, first for the base language, then for 
resources, then for exceptions. In Section 5 we compare to related work and in 
Section 6 we summarize the key ideas and outline future work. 


2 Overview 


Here, we provide an informal overview of our main ideas and the language Ap. 
We start by re-iterating how regions are used for resource management. We 
then introduce exceptions and show how we translate them to CPS. Finally, we 
combine resources and exceptions and demonstrate how our translation reveals 
information about the use of resources in the presence of non-local exits. 


2.1 Regions for Resources 


As a first example, let us see how regions can be used to manage file handles 
in A,. Our type system follows Fluet and Morrisett [12] and Kiselyov and Shan 
[19] with some minor differences. 


Example 1. Consider the following simple example, which copies the first line of 
a file "input" into a file "output" and additionally inserts a line at the beginning 
and a line at the end of the output file. Both files are automatically closed and 
any attempt, accidental or not, to use them after they are closed will fail. 


pool { [ri](pi : Pool r1, 11 : ri C Top) > 
val out: File rí = open(p1, "output", 0); 
writeln(out, "start", 0); 
pool { [r2](p2 : Pool r2, 12 : r2 C r1) > 
val in: File r2 = open(p2, "input", 0); 
val firstLine = readln(in, 0); 
writeln(out, firstLine, 12) 


}; 
writeln(out, "end", 0); 
return () 
} 
We use a pool { ... } statement to create a fresh resource pool. A pool is a 


reference to a list of open files. All files in this list are automatically closed when 
control flow leaves the enclosed block. The pool statement introduces a region 
variable r1, a pool variable p1 and subregioning evidence 11. We then open 
the file "output" in pool pi. In our type system, every statement is checked 
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in a region. The overall statement is checked in the top-level region Top. The 
enclosed block is checked in region r1. When we open a file, we have to explicitly 
pass evidence that the current region is a subregion of the pool’s region. In this 
example, we pass the reflexivity evidence 0 : r1 EC ri. We create a second pool 
p2 in a second region r2, which is clearly inside of r1. This fact is witnessed by 
the evidence variable 12. When we write to the output file, we have to provide 
evidence that the current region r2 is inside of the file’s region r1. We provide 
T2 22. L rl, 


For this simple example, after applying our CPS translation and some beta 
reduction we get the following straight-line code. 


Ak. 
let p; = createPool (); 
let out = openFile pı "output"; 
writeLine out "start"; 
let pọ = createPool (); 
let in = openFile p2 "input"; 
let firstLine = readLine in; 
writeLine out firstLine; 
releasePool p2; 
writeLine out "end"; 
releasePool pı; 


k() 


The original progam did not contain any interesting control flow and our CPS 
translation results in a sequence of primitive operations. There is no overhead 
for protecting resources when no exception is thrown. Later we will see how we 
clean up resources when there are exceptions. But first, let us look at our CPS 
translation of exceptions. 


2.2 Regions for Exception Handlers 


Exceptions abort the current computation to an exception handler. An exception 
that is thrown while the corresponding handler is not on the stack results in an 
error condition that we statically prevent from happening. In A,, we use the 
same mechanism for resources and exceptions and enforce exception safety in 
terms of regions: in order to throw to an exception handler, we require evidence 
that the corresponding handler is still on the stack. 

Exceptions in A, are lexically scoped: the connection between a thrown ex- 
ception and its handler is established by a variable that stands for this very 
handler [5, 6, 35, 36]. This style of exceptions is in contrast to traditional excep- 
tions, which are caught by the dynamically closest handler. Lexical exception 
handlers have advantages when reasoning about higher-order functions. Opera- 
tionally, each try statement generates a fresh marker at runtime and pushes a 
catch frame with this marker onto the stack. We explicitly pass these markers 
as values of type Catch r. For example, consider the following program. 
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Example 2. The function safeDiv divides two numbers, but throws an exception 
when the second number is zero. 


def safeDiv[r](x : Int, y : Int, e : Catch r) atr { 
if (y == 0) { throw(e, 0) } 
else { return (x / y) } 

} 


In addition to the two parameters x and y, the function safeDiv receives a catch 
marker e. When y is zero we throw to e. For this to be safe we need to guarantee 
that we only throw to e in the dynamic extent of the corresponding exception 
handler. But this is the very same problem we had with pools. So we use the 
very same solution: When we throw to a catch frame of type Catch r we have 
to provide evidence that the current region is a subregion of the catch’s region, 
in this example 0 : r E r. 


The function safeDiv is region polymorphic. It abstracts over a region vari- 
able r. It is also annotated to run in the region r. To handle the exception we 
use our safeDiv function as follows. 


try { [r1] (e1 : Catch r1, 11 : ri C Top) > 
safeDiv[r1] (5, 0, e1) 
} catch { return 0 } 


Very much like the pool statement, the exception handler introduces a region 
variable r1, a handler e1, and subregioning evidence 11. In the call to safeDiv, 
we instantiate the region variable r to r1 and pass the exception handler e1. 
The example illustrates that we can guarantee exception safety by the very same 
mechanism we use for region safety. 


When we translate this program to CPS, inline the function safeDiv, and after 
applying beta reduction and commuting conversions we get the following: 


Akg. if (0 = 0) then kz 0 else kz (5 / 0) 


When we translate programs to CPS, control flow becomes explicit. This is also 
true in the presence of control effects like exceptions. Because of this, optimizing 
programs in CPS amounts to beta reduction. How then can we achieve the same 
in the presence of resources and exceptions? 


2.3 Combining Resources and Exceptions 


Consider the following simple program that mixes pools and exceptions. 


Example 3. We install an exception handler and create two resource pools. We 
open a file in the inner pool, open a file in the outer pool, and then throw an 
exception. 
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try { [ri](e1 : Catch r1, 11 : ri C Top) => 
pool { [r2](p2 : Pool r2, 12 : r2 C r1) > 
pool { [r3](p3 : Pool r3, 13 : r3 C r2) > 
open(p3, "input", 0); 
open(p2, "output", 13); 
throw(e1, 13 @ 12) 


} 
} 
} catch { return 1 } 


To open files into pools, we have to provide evidence, as before. To throw an 
exception to the outer handler e1, we provide evidence that region r3 is inside of 
r1. We compose evidence variables 13 © 12, to get evidence of type r3 C r1. 


This program, after CPS translation, reduces to the following program. The 
exception handler is known and will be eliminated. Again, simplifying control 
flow amounts to beta reduction as usual in CPS. 


Ak. 
let pp = createPool (); 
let pa = createPool (); 
openFile p3 "input"; 
openFile p2 "output"; 
releasePool p3; 
releasePool ps; 


k1 


In our framework, these simplifications of control flow also correctly account for 
proper creation and release of resources. We can blindly reduce the translated 
program without any extra considerations. 


2.4 First-Class Functions 


The language A, supports first-class functions. For example, consider the follow- 
ing program which factors out a common pattern as a higher-order function. 


def withFile[r0] (path: String, f: [r](File r, r E r0) —>r Unit) at ro { 
pool { [r1] (p1: Pool ri, 11: ri C r0) > 
val file = open(p1, path, 0); 
f [r1] (file, 11) 
} 
} 


The function withFile is region polymorphic. It abstracts over the region ro it 
can be used in. The function f must be region polymorphic too, because we use 
it under a new region r1. We instantiate its region parameter with r1 and pass 
evidence 11. It would be possible to write withFile with the following signature: 


withFile : [r0] (path: String, f: [r](File r) —>r Unit) —>r0 Unit 
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Here, the function parameter f would not receive any evidence. This variant of 
withFile would be less useful, as f could not access any resources from outside 
of the call-site of withFile. 


3 A Language with Regions, Resources, and Exceptions 


In this section, we formally present A, and its operational semantics. We will 
introduce A, step-by-step starting with a base language with support for type- 
level region tracking but no interesting term-level features that make use of them. 
We then add resource pools, exceptions, and finally consider the combination of 
the two. The operational semantics is given in terms of an abstract machine that 
manipulates a runtime stack. In Section 4, we present a CPS translation of Ap, 
following the same incremental development. 

The paper is accompanied by a mechanized formalization of A, and its op- 
erational semantics in the Coq theorem prover [3], including the usual theorems 
of Progress (Theorem 1) and Preservation (Theorem 2). Resource- and excep- 
tion safety follow as corollaries: whenever we use a resource (like a file) it is live 
(Corollary 1), and whenever we throw an exception the corresponding handler 
is on the stack (Corollary 2). 

Our operational semantics will push freshly generated markers onto the run- 
time stack. A region is the list of concrete markers on the stack and evidence is 
the list of markers that is the difference between two such lists. Although they do 
not play any role computationally, for our proofs we will substitute these lists for 
region variables and evidence variables at runtime. Our typing rule for runtime 
evidence then makes proving region safety and exception safety possible. 


3.1 Syntax 


Figure 1 defines the syntax of the core of A,. We use fine-grain call-by-value [22] 
and syntactically distinguish between statements, which can have effects, and 
pure expressions. 

Function values (i.e., { [7](@~= T) at p= s}) abstract over a list of type-level 
region parameters (i.e., T), and a list of term-level parameters (i.e., £ : 7). Each 
function is defined to run exactly in a region p, but otherwise functions are 
unsurprising. Since our focus is on the interaction between regions and control 
effects, we omit type abstraction from this presentation. Our mechanization in- 
cludes type polymorphism, which is orthogonal to the rest of the calculus. We 
define the following short-hand notation for named function definitions: 


def f[r](z: 7) at p{ so }; s = valf = return { [7](@ =: 7) at p= so}; s 


The list of region parameters scopes over the parameter types, the return type, 
the annotated region p, and the body s of the function. We apply functions to 
a list of regions p and a list of arguments ©. 

We introduce two additional concepts: type-level regions and term-level ev- 
idence. Type-level regions p are region variables r or the top-level region T. 
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Terms: Types: 
Statements Types 
s x= valz = 8; 5 sequencing T u= Int | Bool |... primitives 
return e returning Vir] (7)? 7 functions 
elp] (e) application pEp evidence 
Expressions Regions 
etis m f han variables nE region variable 
v values T: toplevel region 
(0) reflexivity ev. . 
e Qe transitivity ev. Environments: 
= ø empty env. 
Values PE T,r region binding 
v x= ()|0]1 |... | true |... primitives Paor value binding 
{ |7|(@7T) atp=> s} closures 
Names: 
z,y,bl x= xy fl. value variables r x= r|s|... region variables 


Fig. 1. Syntax of the core of Ap. 


Intuitively, the top-level region denotes the bottom part of the runtime stack. 
Term-level evidence expressions are either the empty evidence 0 witnessing re- 
flexivity of subregioning, or the composition of evidence e © e witnessing tran- 
sitivity of subregioning. By convention, we use the meta-variables f and l to 
stand for variables of function type and evidence type respectively, and we use 
the meta-variable į to stand for expressions of evidence type. 


3.2 Typing 


Figure 2 defines the typing of core A,. We type statements and expressions with 
different judgement forms. While both are typed in an environment I containing 
value and region bindings, only statements are typed in a given region p. State- 
ments may perform effectful (that is, serious in the terminology of Reynolds 
[24]) computation, which is only safe in specific regions. In contrast, expressions 
are pure (that is, trivial) and can be evaluated independent of any region. 


Typing of Statements Rule VAL types sequencing of statements. We type the two 
statements sg and s in the same region p of the compound statement. Returning 
a result of a computation (rule RET) can be typed in any region. In rule APP, 
we apply a function eo to a list of regions p and to a list of arguments €. The type 
of eo is a function type in a region po. The overall statement is typed in a region 
p. The premise p = polr p] requires that, after substituting regions p for the 
region variables 7 both have to syntactically be the same. Note that we do not 
have any implicit or explicit subtyping of function types here or elsewhere. All 
region subtyping exclusively occurs through the passing of subregioning evidence. 


500 Schuster, Brachthauser, and Ostermann 


Statement Typing: Tipts:f 
Tre : YFF) >”®Tto Cre: trp) p= pornp] pi 
PP 
Tip elp) : roA 
Tipt s: I, to: IpF s: Tre: 

p ene ee, 4 [Vat] T [Rer] 
Tipt valto = so; s: T Tip returne : 7 
Expression Typing: TFre:fT 

E 
(2) =T Lyte T ip so 27 
Paar [Var] Pen nt [Err] — p e de — [Fun] 
Tha:t : Dr {[F]\(@77) atp= so} : V[r](T) >" To 
Trke:plp Tee’: p' Cp” 
ate gt ees [RerLexive| Pow Poe TRANSITIVE 
TrrFO:plp rre@e’: pC p” | 


Fig. 2. Type system of the core of Ap. 


Typing of Expressions The typing rules for variables VAR and primitives LIT 
are standard. Rule FUN types functions. We type the body of the function so 
in an environment extended with the region parameters 7 and value parameter 
types x : 7. Every function is annotated with a region p that specifies exactly 
the region it will have to be called in. This region p is also the region we type 
the body so in. The region parameters T may appear in the parameter types, the 
return type, the function’s region p, and body sg. This allows us to write region- 
polymorphic functions that can run in any region. Value parameters of evidence 
type allow us to write region-polymorphic functions that are constrained to run 
in a subregion that meets these constraints. 

Reflexivity evidence 0 witnesses that every region is nested within itself, and 
evidence e © e’ witnesses the transitivity of nesting, which is reflected in their 
typing rules. We require the composition of evidence to be associative. 


3.3 Operational Semantics 


Figure 3 presents the operational semantics of core A,. A machine state (s || K) 
consists of the statement s under evaluation and the runtime stack K. For the 
core of Ap, the stack K is a list of frames of the form valz = O; s. The re- 
duction rules are mostly standard. The first rule (return) returns to the next 
frame on the stack. The second rule (push) focuses on so and pushes a frame 
on the stack. Finally, rule (call) performs reduction by simultaneously substitut- 
ing region arguments p for region variables 7 and trivial expressions € for term 
parameters 7. Region parameters, the annotated region p, and evidence terms 
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Syntax of the Abstract Machine: 


Machine States Stacks Frames 
M := (s||K) K := e |F: K F := vals = O; s 


Machine Steps: 


(return) (return e || valz = O; s :: K) (s[z = e] || K) 
(push) (valz = so; s|| K) 


(cad) ({[r\(@=7) at p= spel > (solr p] [z= e] || K) 


>. 
=- 


(so || valz = O; s : K) 


Extended Syntax: Runtime Regions and Evidence: 


Ue A E. evidence value 


` À w = e evidence values 
SERE runtime region Bre ses Me runtime regions 
Runtime Region of Stack: Evaluation of Evidence: 
RI - | : Kou VI - J : 6 w 
Rie | =e vio] = œ 
Rivals =O; s :: K] = RIK] Vle 8e] = vlaj Vie] 
Y[w] = w 


Fig. 3. Abstract machine semantics of core Ap. 


are operationally irrelevant. As already mentioned, we need them to maintain 
invariants in our proofs. 

The core of Ap, as presented, does not yet contain features with interesting 
operational behavior. While we can abstract over regions, eventually all region 
variables will be instantiated with the top-level region and evidence will always 
be the trivial evidence. 

Figure 3 also defines runtime regions and evidence values in core Ap. We 
extend the syntax of values with evidence values w, and the syntax of regions 
with runtime regions u. Both are empty lists e for now. In the next two sections, 
we will extend their syntax to be lists for markers h. The toplevel region T is 
the empty list runtime region e. 

To connect type-level regions p with the concrete runtime stack K, we define 
a semantic function RI - ], which computes the runtime region of the current 
stack. In core Ap, the only possible runtime region is the empty list. To give 
meaning to evidence expressions, we define a semantic function V| - ]. Currently 
the only possible evidence value is the empty list. 


3.4 Resource Pools 


In this subsection, we add statements for region-based resource management to 
A,. As in the introduction, we use files as an example for resources. Figure 4 
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new resource pool 
open file 
read contents 


Syntax: 
Statements 
s EE 
| pool{[r](z, 1) +5} 
| open(e, eo, å) 
| readIn(e, i) 
Typing Rules: 
T,r,x2: Poolr,l:rC pirFs:T 
Tipt pool{ [rj(z, J) Ss}: 7 


[Poot] 


T 
r 


Fi:pE 


EA D] 


Ite: Poolo’ PTF e: String Fri:pC p 


Tip open(e, eo, i) : File p’ 


[Oren] 


R 
Tipt readin(e, i) : String | 


Fig. 4. Syntax and typing rules of resource pools. 


Syntax of Frames: 


F n=... | #pool, {0} 
Machine Steps: 
(release) 
(return e || #pool, {0} :: K) 
(pool) 


(pool { [r] (x, 1) = so } || K) 
do h = createPool() 


(open) 
(open(h, e, i) || K) 
do z = openFile(h, e) 


(read) 
(readIn(p, i) || K) 


resource pool frame 


— (return e || K) 


> (solr ul[x h][l— w] || #pool, { 


Fi 
where u = poh :: RIK], and w = pok :: 


— (return z || K) 


— (return z || K) 


do x = readLine(p) where h = p.getPool 


Runtime Regions and Evidence: 


do releasePool(h) 


when poh in RIK] 


when poh in RIK] 


Runtime Region of Stack: 


h ::= @adf | @4b2 |... markers RI #pool, { 
w = .. | poh : w evidence value 
u x=  ..|poh : u runtime region 


} ṣu K] = poh 


: K) 


= RIK] 


Fig. 5. Abstract machine semantics of pool-based resource management. 
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Syntax: 
Statements Types 
s — e PEE e 
| try { [r](x, 1) = so } catch { s } handling | Catch pT 
| throw(e, i) throwing 
Typing Rules: 
T,r,2: Catchr,l:r E pirk so: 7 Tt e: Catch p’ 
Tipks:7f rival Tei:pl 
T “Ts 
Tipt try{[r](z, )>so}eatch{s}: 7° T ipF throw(e,i): T° O 


Fig. 6. Syntax and typing rules of exceptions. 


introduces three additional statement forms, which introduce and eliminate non- 
trivial evidence to assert that all files are correctly closed. The pool statement 
delimits a new region in which we run the enclosed statement s. It introduces 
three variables, a fresh region variable r, a variable z : Pool r, and evidence 
l: r C p, witnessing that the fresh region r is a subregion of the outer region p. 
The open statement receives a pool argument e, a filename ep, and an evidence 
argument 7: p E p' that witnesses that the current region p is nested within 
the pool’s region p’. Rule READ for readIn statements is similar. 

Figure 5 extends the operational semantics. Frames can now be pool frames 
which contain a marker h. In rule (pool), we allocate a fresh marker h and push 
a pool frame onto the stack. In rule (release), we pop the pool frame and release 
the pool h, closing all associated resources. Our goal is to ensure that all access 
to marker h happens between these two steps. 

To this end, rules (open) and (read) dynamically assert that the marker h 
is on the current stack K. Accessing a pool that fails this test would result in 
a stuck term. As it turns out, the mere existence of evidence i suffices to show 
that the assertion always succeeds (Corollary 1). 

For our proof of this fact, Figure 5 extends the syntax of runtime regions 
and evidence. Runtime regions now include lists of pool markers and so do 
evidence values. The runtime region of a stack K is the list of markers that 
have been pushed onto it. We extend the function R| - ] to extract this list. 
During execution, region variables r stand for runtime regions u. In rule (pool) 
we substitute the runtime region poh :: RIK] for the region variable r and 
the singleton list poh :: e for the evidence variable l. Later we will see how 
the typing rule for evidence values connects type-level runtime regions with the 
concrete runtime region of the current stack K. 


3.5 Exceptions 


Figure 6 extends A, with two new statement forms. The try ... catch ... statement 
delimits a new region in which we run the enclosed statement so. It introduces 
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Syntax of Frames: 


F n=... | #eatch, {O} {5} catch frame 


Machine Steps: 


(popcatch) 

(return e || #catch, {DO} {s} :: K) — (return e || K) 
(try) 

(try { [r](z, 1) => so } catch { s } || K} > 


(sofr = u][z => h] [I w] || #catch, {O}{s} : K} 
do h = generateFresh() where u = cah :: RIK] and w = cah :: e 


(throw) 

(throw(h, i) || K) — (throw(h, V]i]) || K) 
(unwind) 

(throw(h, w) || vala = O; s : K) — (throw(h, w) || K) 


(forward) 
(throw(h, cah’ :: w) || #catch, {O}{s} =: K)— (throw(h, w) || K) 
where hh’ 


(catch) 
(throw(h, e ) || #catch, {O}{s} =: K} > (s||K) 


Runtime Regions and Evidence: Runtime Region of Stack: 
w u=  ..|cah :: w evidence value Ri #catch,{O}{s} : K] = 
: ; cah :: RIK] 
u x= ..|cah :: u runtime region 


Fig. 7. Abstract machine semantics of exceptions. 


three variables, a fresh region variable r, a variable x : Catch r, and an evidence 
variable l :r E p, witnessing that the fresh region r is a subregion of the outer 
region p. The throw statement receives a handler e to throw to, and evidence 
that the handler’s region p’ is nested in the current region p. 


Figure 7 extends the operational semantics. Frames can now be catch frames 
with a marker h and a catch statement s. In rule (try) we generate a fresh 
marker h and push a catch frame with this marker and the catch statement onto 
the stack. The handler x is this marker h. In rule (popcatch) we pop this catch 
frame upon normal return. In rule (throw) we transition from normal execution 
to unwinding. h is a catch marker, and V| i] evaluates the evidence expression i 
to a list of catch markers. In rules (unwind) and (forward) we unwind the stack 
frame-by-frame until we find the matching catch frame (catch). Because each try 
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Extended Machine Steps: 


(free)  (throw(h, poh’ :: w)||#pool,,{O} : K) — (throw(h, w) || K) 
do releasePool(h’) 


Fig. 8. Abstract machine semantics of combining resources and exceptions. 


statement generates a fresh marker at runtime, and we search for this marker 
during unwinding, exceptions have generative semantics [5, 6, 35, 36]. 

Figure 7 extends the syntax of runtime regions and evidence. They now in- 
clude lists of catch markers. Again, evidence guarantees that unwinding never 
fails, i.e. the corresponding marker is always somewhere on the stack. Remark- 
ably, we pop elements off the evidence value w in lock-step with popping catch 
frames off the stack and never get stuck in doing so. We always find the match- 
ing catch frame exactly when the evidence value is the empty list. The evidence 
value precisely reflects the list of markers between the region of the throw state- 
ment and the region of the catch statement. Importantly, this also holds for the 
combined language A, (Corollary 4). 


3.6 Combining Resource Pools and Exceptions 


When we extend the core language with both pools and exceptions, we notice 
that the machine gets stuck when we would have to unwind through a pool 
frame. Figure 8 extends the reduction relation with this missing case. When we 
unwind through a #pool,, frame, we release the pool h’. In full A, regions are 
lists where the elements are either a pool marker or an exception marker. Evi- 
dence is, again, the same. Having to add the rule in Figure 8 shows that under 
our operational semantics, the two extensions are not orthogonal. We have to 
explicitly consider their interaction. In Section 4, we define a CPS translation for 
A,. Remarkably, both extensions can be defined separately and the correct inter- 
action automatically arises from their composition. Perhaps more importantly, 
the resulting terms in CPS can be reduced freely without having to consider the 
interaction between pools and exceptions. 


3.7  Metatheory of A, 


We started out with core A, only supporting regions and subregioning evidence. 
We then added two extensions, pools and exceptions, first individually, then 
together to arrive at the full language. Although we use resource pools for 
files as an example, our approach generalizes to region-based management of 
any resource. Indeed, in our mechanization, we do not model files and the pool 
statement only pushes and pops the fresh marker. Instead of open and readin 
we have a statement check with the following typing rule: 


Tre: Poolp’ Tbri:plp Tipts:t 
I'ipt check(e, i); s: 7 


[Cueck] 
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Stack Typing: LK: 
>: TIR| KE s: FK: 
ree [ExT] SEn] = = - ; = [Frame] 
F valz =O;s : Kit 
EF K:7 OIREK]F s:r EK: 7 
z = 3 [ Poot] = ; [ Carcu] 
#pool, {O0} =: K: 7 + #catch, {O}{s} : Kir 
Abstract Machine Typing: Evidence Value Typing: 
OIREK]t s:r re Kase uo = wu 
[Macuine] re [Evence] 
F (s || K) ok ØF w: uw E w 


Fig. 9. Abstract machine typing of Ap 


It asserts that the given pool is on the current runtime stack, i.e. live, and crashes 
the program if it is not. Otherwise it continues to execute statement s. We can 
safely access resources by first performing a runtime check and then using unsafe 
primitive operations. For example we would define 


open(e, eo, i) := check(e, i); openFile(e, eo) 


As we will see shortly, this check never fails. 


Soundness We mechanized the formalization of A, in the Coq theorem prover 
and showed the usual theorems of progress and preservation of the stepping 
relation on machine states M. 


Theorem 1 (Progress). 


If +} Mok, then either M—>M or M is of the form (returne || ¢) for some 


ezpression e. 


Theorem 2 (Preservation). 


If} Mok and M-+M then} M ok. 


Figure 9 presents the typing rules for the abstract machine. An abstract machine 
state is well-typed when the statement s is well-typed in the concrete runtime 
region of the stack K. The typing judgement F K : 7 types stacks K that expect a 
value of type T. An evidence value is well-typed when it is the difference between 
the two runtime regions uo and u. 


Properties The following properties follow directly from progress and preser- 
vation. Firstly, whenever we use a pool, it is live. The operational semantics 
inspects the runtime stack. But since the check always succeeds we do not have 
to actually perform it. 
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Corollary 1 (Resource Safety). 


If (open(h, eo, i) || K) ok, then poh is in RIK]. 


Secondly, whenever we throw an exception, the corresponding handler is on the 
stack. Moreover, as we have seen from the operational semantics, during the 
search for the correct handler, we encounter precisely the markers that are in 
the evidence value. 


Corollary 2 (Effect Safety). 


If (throw(h, i) || K) ok, then cah is in R] K]. 


Thirdly, every function runs in exactly the runtime region its type requires. 
In other words, the type-level region p will at runtime stand for the concrete 
runtime region of the stack this function is called in. 


Corollary 3 (Region Correspondence). 


If ({ [F]|(@zT) at p => so }[ul(@) || K) ok, then pru] = RIK]. 


Finally, evidence values are exactly the difference between the two regions. This 
corollary is inspired by the similarly named theorem of Xie et al. [34]. 


Corollary 4 (Evidence Correspondence). 


If an evidence value w has type po E pı, then po and pı are runtime regions uo 
and u and uy = w ++ uy. 


Together, these corollaries make runtime evidence on the one hand and marker 
frames on the stack on the other hand redundant. The unwinding can either 
use evidence terms, or markers on the stack, since the two agree. The opera- 
tional semantics uses both to establish this fact. The liveness check for pools is 
redundant since it always succeeds. It only exists to establish this fact. 

We could erase evidence terms and only rely on marker frames on the stack. In 
the next section, we are going to CPS where there is no stack. Therefore we will 
do the opposite: Erase marker frames and purely rely on evidence terms to have 
the correct content at runtime. This is possible because of the correspondence 
between evidence and runtime regions. Ultimately, this allows us to prove that 
CPS translated terms behave exactly as the operational semantics (Theorem 4). 


4 Translation of Regions, Pools, and Exceptions to CPS 


We now present the translation of A, into System F (with file primitives) in CPS. 
As a result of the translation, the stack K becomes an evaluation context [10], 
regions become answer types, and evidence terms become answer-type coercions. 
As before, we will define the translations of core A, and the two extensions with 
file pools and exceptions step-by-step. Our translation can serve as a compila- 
tion technique for languages with control effects and resources into any language 
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Translation of Types: Translation of Expressions: 


T[Int] = Int Elz] ar 

Tir] =e ELLIE) at p> s}] = Arde. S[s] 

TIT] = Void Elo] = Aa.\m.m i 

TIVIT) >? ro] = Ele De] = Aa. Am.E]eı]a(Eļe2]am) 
Yr. TIT] CesT pe] 7 [70] 

TlepC el = 


Va. CpesT[ p’ ]a-> CrsT[ pla 


Translation of Statements: Auxiliary Definitions: 
S[valz = so; sı], = Ak. S[ so], (Av. SI sı], k) CesRA = (A>R)>R 
S[returne], = Ak.k (E[e]) 


Slepe], = Ele] Tle] Ele] 


Fig. 10. Translation from core A, to System F. 


that supports first-class functions, making it widely applicable. Moreover, as 
demonstrated by Schuster et al. [26], modeling control effects with CPS can en- 
able compile-time optimizations for significant performance improvements. We 
implemented the CPS translation of A, as a shallow embedding in Idris 2 [7]. 


4.1 Translation of Core A, 


Figure 10 defines the translation of core A, to System F. Our translation targets 
one particular variant of CPS, called iterated CPS [11, 25]. Every stack segment, 
delimited by a marker, is represented by its own continuation argument. That is, 
in iterated CPS, functions do not receive one but potentially multiple continua- 
tions. This will only become relevant in the presence of exceptions (Section 4.3). 


Translation of Types Base types, such as Int are left unchanged by the transla- 
tion. We translate region variables to type variables in System F and the toplevel 
region to the empty type Void. The translation on types shows that the iterated 
CPS translation is (so far) very similar to the traditional CPS translation. In 
particular, the auxiliary meta-definition CPs R A is defined as the familiar type 
(A—>R)—> R of computations in CPS with return type A and answer type R. 
Evidence terms are functions between effectful computations, as can be seen 
from the translation of evidence types. 


Translation of Terms As usual in CPS, we translate sequencing of statements 
to push a frame onto the current continuation k, that is, the continuation first 
runs sı and then continues with k. Return statements are translated to tail 
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Extended Translation Rules: 
T[ Pool p] = PrimPool 
T[ File p] PrimFile 


S[ pool {[r](z, 1) = 50 }], = 
RunPoot (Ahk. (Ar.At.Al.S[s0],) (Tlel) h (LirrPoot h)) 


S[open(e, eo, i)], = Ak.k (openFile Efe] Eleo]) 


S| readIn(e, <) ] = Ak.k (readLine €[e]) 


P 


Auxiliary Definitions: 


RuNPOOL : (PrimPool—Cps R A)—>CpsRA 

RuNPOOL = Am. Ak.leth = createPool (); mh (Au. releasePool h; k x) 
LıFTPOOL h : Va.CpsRa—CrpsRa 

LiFTPOOLA = Aa.XAm.Ak.releasePoolh; mk 


Fig. 11. Translation of A, with resource pools. 


calls of the current continuation. Again, viewing continuations as stacks, this is 
in accordance with the operational semantics given in Section 3.3. In general, 
statements with return type 7 that have to be run in a region p are translated 
to terms of type CPs 7 [p] 7 [7]. This can for instance be seen in the translation 
of function types. We translate regions to answer types. Region abstractions 
are translated to type abstractions and region-polymorphic functions have a 
polymorphic answer type [30]. We translate evidence expressions to functions 
that lift a computation to run in a different region. The reflexivity evidence 
is translated to the polymorphic identity function, and transitivity of evidence 
amounts to function composition. 

In the remainder of this section, we present the rest of the translation of our 
language with pools and exceptions A,. Later, we show that the translated code 
in CPS simulates the operational semantics given in Section 3. 


4.2 Resource Pools 


In Figure 4, we have seen the definition of A, with resource pools. Figure 11 
defines the translation to CPS. As we have seen in Section 3.7, we do not need 
any runtime checks to prevent markers and files from being used outside of their 
region. Indeed, in CPS there is no stack, which we could check for markers. 

The pool statement creates a fresh resource pool. The translation instantiates 
r with the outer answer type T| p]. When control leaves the enclosed block, the 
pool is released. In its translation we use the auxiliary function RUNPOOL. It 
binds the current continuation k and creates a fresh pool h. We run the given 
computation m with h and a continuation where we push a frame that releases 
the pool onto the current continuation k. This ensures that we releases the pool 
when we return normally from the enclosed block. 
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Extended Translation Rules: 
T| Catch o] = CpsT|[p] Void 


S[try { [r](x, 1) = so } catch { s }], = 
RunCps ((Ar. Az. Al. S| so],) (CPsTlp]7I7]) (Ak. S[s],) (LirrCps)) 


S[throw(e, 7) ]], = Efi] Void Efe] 


Auxiliary Definitions: 


RuNCps : Cps(CpsRA)A>CrpsRA 
RuNCps = Am.m (Az. Ak. k s) 

LirrCps : Va. CPs Ra—> Crs (Crs RR’) a 
LirTCps = Aa. Am. Ak. Aj.m (Az. kz j) 


Fig. 12. Translation of A, with exceptions. 


Evidence terms are functions LIFTPOOL h that release the pool h. Our types 
make sure that we evaluate the evidence if-and-only-if we non-locally leave the 
body of the pool. In Section 3.4, evidence was a list of pools. Here, evidence still 
contains a list of pools, but this list is hidden in the closure environment of the 
evidence. Evidence composition conceptually concatenates these lists. 

The open statement opens a file and registers it in the pool. The readin 
statement uses a primitive to read from a file. We require evidence that the pool 
is live, ¿.e. on the runtime stack, but do not have to actually use it. As we have 
seen in Section 3.7 its existence is enough to assert that accessing the file is safe. 


Example 4. Let us consider a simplified version of the motivating example (Sec- 
tion 2.1). The example on the left translates to the term in System F on the 
right. It has type CPs Void Int. 


pool { Ak. 
[ri] (pi: Pool ri, 11: ri C T) > leth = createPool (); 
val f = open(p1, "input", 0); (Ary. Api. Al. Akı. 
return 0 let f = openFile pı "input"; 


(Aa. Am. Ak. releasePool h; mk) 
(Az. releasePool h; k x) 


This term can be normalized to the following: 


Ak. leth = createPool (); let f = openFileh "input"; releasePool h; k0 


4.3 Exceptions 


In this subsection, we present the translation of exceptions. Whereas in the 
operational semantics (Section 3.5) we have divided the stack into regions with 
markers, we now have multiple stacks, i.e. continuations. We have seen that 
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evidence terms contained exactly the list of markers we have to unwind when we 
throw to a handler. Now we take advantage of this fact and let the evidence be 
the unwinding action itself. Figure 12 presents the translation of exceptions. It is 
different from the translation to double-barrelled CPS [17, 29], where functions 
only ever get exactly two continuations. Under our translation to iterated CPS 
functions can receive any number of continuations. 

To support aborting the computation, we instantiate the answer type r of 
the translated body so to be the type CPs 7 [p] 7r]. This adds another layer 
of CPS and one additional (curried) continuation argument. In the translation 
of try ... catch ... statements, we use RUNCPS. It runs the given computation m 
with an additional continuation which is initially empty. The evidence / lifts the 
given computation from the inner region to the outer region. It will be bound 
to LIFTCPs which pushes the current continuation onto the next one. 

A Catch p is a CPS expression that aborts the computation. That is, the 
handler (Ak. S| s]],) discards the current continuation k. In the translation of 
statement throw(e, i), we call the provided evidence i and then the handler 
e. Running the evidence lifts the handler into the correct region, making it 
compatible with the current answer type. It is safe for the handler to discard the 
continuation k, since all cleanup actions contained in k are run by the evidence. 


Example 5. Let us consider the example from Section 2.2. The example on the 
left translates to the resulting term of type CPs Void Int on the right. 


try { [ri](e1 : Catch r1, 11 : ri C T) > (Ari. àe. Ah. 


safeDiv[r1] (5, 0, e1) safeDiv rı 50 e 
} catch { ) (CPs Void Int) 
return 0 (Aki. Aka. k2 0) 
} (Aa. Am. Ak. Aj. m (Ax. k z j)) 
(Ax. Ak. k z) 


The resulting System F term can be beta reduced and eta expanded to: 
kp. safeDiv (CPs Void Int) 5 0 (Aki. Ako. k2 0) (Ax. Ak. k £) k2 


We instantiate the answer type r of safeDiv with rı, which itself is instantiated 
with CPs Void Int. The return type is CPs (CPs Void Int) Int and our program now 
receives two continuations. To abort, the exception handler discards the first one 
(i.e., kı) and returns 0 to the second one (i.e., k2). 


4.4 Combining Resource Pools and Exceptions 


Well-typed programs in A, translate to well-typed programs in System F. 
Theorem 3 (Well-typedness of Translated Terms). 


If Tipt s:7, thenT[L]F Sls], : T17]1>7[el) 71] 


Proof (Proof). 
Straightforward induction over the typing derivation. 
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The translation of exception handlers in Section 4.3 automatically interacts cor- 
rectly with the evidence terms we have defined for resource pools in Section 4.2: 
We clear a pool exactly when an exception is thrown across it. This is because 
we have chosen the translation of evidence to be a concrete computation that 
moves from one region to another one. 


Example 6. The following is an extended example where we combine resource 
pools and exceptions in a more complicated way. The program splits a large 
input file into smaller files of 100 lines each. 


try { [ri](stop : Catch ri, 11 : ri C Top) > 
withFile[ri] ("input", { [r2](in: File r2, 12 : r2 C r1) => 
def copyFile(target : String) at r2 { 
withFile[r2] (target, { [r3] (out: File r3, 13 : r3 C r2) > 
def copyLine() at r3 { 
if (isEOF(in, 13)) { throw(stop, 13 @ 12) } 
else { writeln(out, readln(in, 13), 0) } 
3; 
def innerLoop(toCopy : Int) at r3 { 
if (toCopy > 0) { copyLine(); innerLoop(toCopy - 1) } 


F; 
innerLoop(100) 
}) 
}; 
def loop(n : Int) at r2 { copyFile("output" ++ n); loop(n + 1) }; 
loop(0) 


H 
} catch { return () } 


When we encounter the end of the input file, we simply throw an exception to 
terminate the program. We can be confident that all resources will be properly 
cleaned up and so fearlessly use exceptions to structure control flow. The outer 
loop, for example never returns. It is terminated by throwing an exception. This 
program, after CPS translation, manually applying contification |17], and beta 
reduction, reduces to the code in Figure 13. 

Our CPS translation of both regions and control enables aggressive opti- 
mization. For example, at the end of the input file, we immediately release both 
pools and return. Since we only apply well-known optimizations on functional 
programs, we can be certain of their correctness without having to reason ex- 
plicitly about resources nor control effects nor their combination. The overall 
correctness of the optimized result rests on the correctness of our CPS transla- 
tion. 


4.5 Simulation of the Machine Semantics by the CPS translation 


In Section 3, we defined an operational semantics for Ap. In this section we 
defined a CPS translation for A,. We now show that the two behave the same. 
This entails that the operational properties from Section 3 carry over to the 
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Ak. 
let p? = createPool (); 
let in = openFile p2 "input"; 
let rec loop n = (Aki. 
let p3 = createPool (); 
let out = openFile ps ("output" ++ n); 
let rec innerLoop toCopy = (Ake. 
if (toCopy > 0) 
then if isEOF (in) 
then releasePool p3; releasePool po; (Aka. ka 0) 
else let line = readLine in; writeLine out line; innerLoop (toCopy — 1) 
else releasePool p3; loop (n + 1) 
); 
innerLoop 100 
); 
loop 0k 


Fig. 13. Result of translating Example 6 to CPS. 


CPS translation and that optimization via beta reduction is sound. To show 
preservation of semantics, we extend our translation to machine states |4, 15]. We 
translate statements to terms and stacks to evaluation contexts in System F. We 
define the translation M| - ] of machine states as the plugging of the translation 
of the statement into the translation of the stack. The full translation is available 
in a separate technical report [27]. 

We show that for each step the machine takes, there is a corresponding 
(possibly empty) sequence of steps between the translated terms. 


Theorem 4 (Simulation). 
If M>M, then M[M]>* M[M']. 


Proof (Proof). 
By considering each case of the stepping relation. The (throw) step needs its 
own lemma, which we show by induction on possible evidence expressions. 


Since for simulation we are only interested in operational behavior, we target 
the untyped lambda calculus (with primitives for file management) instead of 
System F. The translation of statements is the same as S[ s ], in Figures 10, 11, 
and 12, but we erase all type annotations, type abstractions, and type appli- 
cations. There is no harm in doing so, since our target is in CPS where the 
evaluation order is explicit. 

While the operational semantics given in Section 3 discards frames during 
unwinding, for our proof of simulation we have to retain them. We do so in a 
third component of the machine state (throw(h, w) || K || H}: the stack trace H. 
This is necessary because the CPS translation discards the whole continuation 
in one step, while the operational semantics unwinds the stack frame-by-frame. 
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We translate the empty stack to a special primitive function done, which 
will return the overall result of the program. It is called exactly once, when the 
machine is in its final state and we return to the empty stack. 


Example 7. Pools are created and released exactly when they would be in the 
operational semantics. As an illustration, consider the following sequence of ma- 
chine steps where we unwind a pool frame: 


throw(h, (poh2 :: ¢@)) || #pool,, {O0} :: #eatch,, {OO} {return1} :: © )— 


throw(h;, (poha :: ¢)) || #pool,, {O} :: #eatch,, {O }{return1} :: œ || e)— 


throw(hi, è) || #catch,, {O } {return1} :: e || #pool, {O0} : e)— 


( 
( 
( 
( return 1 || o) 


The first step (throw) goes from normal execution to the unwinding state which 
accumulates frames in its third component. The next two steps are (free) and 


(catch). In CPS, we can observe the same program trace: 
((LirTPoot h2) (Akı. Akg. k2 1)) (Av. releasePool ho; (Ax. Ak. k x) x) done + 
(Ak. releasePool h2; (Aki. Akg. k2 1) k) (Av. releasePool ho; (Ax. Ak. k x) x) done > 
(Aki. Ako. k2 1) (Ax. releasePool h2; (Ax. Ak. k x) x) done > 
(Akg. k21) done 


Example 8. Although we do not have any markers generated at runtime, the CPS 
translation exactly mimics the behavior of the operational semantics, which does 
have them. Consider another example, where we throw an exception to an outer 
handler. The steps are (throw), (forward), and (catch). 


throw(hi, (ho :: e@)) || #catch,, {O }{return2} :: #catch,, {DO} {returni} :: e)—> 


throw(hi, (ho :: e@)) || #eatch,, {CO} {return2} :: #catch,, {0O} {return1} :: e || è )— 


throw(h,, e) || #catch,, {UO} {return1} :: e || #catch,, {UO} {return2} :: e)—> 


( 
( 
( 
( return 1 || è ) 


In CPS, we start out with three continuations, then we push the first one onto 
the second one, then the exception handler discards both in one step: 
(LirtCps (Aky. Aka. k2 1)) (Ax. Ak. k £) (Aw. Ak. k £) done > 
(Ak. Aj. (Aki. Aka. k2 1) (Ay. k y j)) (Ax. Ak. k £) (At. Ak. k £) done > 
(Akı. Ako. k2 1) (Ay. (Ax. Ak. k £) y (Az. Ak. k £)) done > 
(Ak2. k2 1) done 


The CPS translation exhibits the same behavior as the operational semantics. It 
simulates the generative semantics of exceptions. Remarkably, it does not need 
any runtime support for markers on the stack to do so. Indeed, in CPS there is 
no stack! 
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5 Related Work 


Out of the large body of work on regions, the one most closely related, and 
indeed which has been the basis of our work, is the one by Kiselyov and Shan 
[19], which in turn is based on work of Fluet and Morrisett [12]. Kiselyov and 
Shan provide a library for region-based resource management in Haskell. They 
demonstrate how types, regions, and subregioning evidence are inferred, which 
we do not discuss. They deal with builtin Haskell exceptions, but leave a formal 
proof to future work. We go further, and add exceptions as a language feature, 
and prove region- and exception safety. Moreover, we present a CPS translation 
of these features. 

Crary et al. [9] present a language with dynamic regions, where regions do not 
have to be nested, resource access is safe, but resource cleanup is not automatic 
but explicit. Their language is presented in CPS. Indeed, to quote Fluet et al. 
[13]: “Dynamic regions are not restricted to LIFO lifetimes and can be treated as 
first-class objects. They are particularly well suited for iterative computations, 
CPS-based computations, and event-based servers where lexical regions do not 
suffice.” We present a CPS translation of lexical regions where resources are 
automatically released, even when an exception is thrown. 

Clearly also related is the line of work on monadic encapsulation of state 
[20, 23, 28]. The most recent work in this line [31] presents a mechanized proof 
of a number of equivalences in the presence of encapsulated mutable state. We 
merely prove that references are not used outside of their region, but do so in 
the presence of exceptions. 

Kiselyov and Ishii [18] present a Haskell library for effect handlers based on a 
variant of the free monad in Haskell. Their library supports user-defined effects 
and handlers and they provide a range of pre-defined effects like exceptions, non- 
determinism, and state. They also discuss a region effect for safe and automatic 
allocation and release of resources, which correctly works in the presence of the 
exception effect. Other effects, like non-determinism, are explicitly ruled out by 
the type system when they would be used across a resource delimiter. They reify 
the structure of the program as a free monad and then write interpreters over 
this structure, whereas we translate programs to CPS. Moreover we provide a 
proof of region- and exception safety, which is out of scope of their work. 

Leijen [21] reports on an extension of the programming language Koka with 
support for resources and finalization. They support general effect handlers, 
while we merely discuss the special case of exceptions. Their approach requires 
sophisticated modification of the language runtime, whereas our approach can 
be explained as a translation to pure System F. They allow for more complex 
finalization patterns, where users explicitly run the finalizers of a resumption. 
This is to avoid running finalizers on linearly used resumptions, a problem that 
we completely side-step by only discussing exceptions. 

Ahman and Bauer [1] present an approach to resource management: Run- 
ners. They guarantee that cleanup actions are run exactly once. We offer the 
same guarantee. We present an operational semantics that relates resource man- 
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agement to the stack and a translation of programs to CPS. Their denotational 
semantics translates programs to essentially a free monad. 

Thielecke [29] compares different control constructs by their translation to 
double-barrelled CPS where functions receive exactly two continuations. In con- 
trast, under our iterated CPS translation functions can receive any number of 
continuations. Moreover, even in the case where we pass two continuations, there 
is a difference. Whereas in double-barrelled CPS translated terms have type: 


(| A] > Ans) > ([ B] — Ans) > Ans 


Under our iterated CPS translation such terms would have type: 


(J A] > ({ B] - Ans) > Ans) > (| B] > Ans) > Ans 


Their work is neither concerned with resources nor multiple different exception 
handlers. 

Thielecke [30] studies the connection between control effects and continuation 
passing. His work introduces some of the ideas presented in this paper: regions 
are answer types, region polymorpism is answer-type polymorphism, and effect 
masking introduces a fresh region to delimit the extent of control effects. We ex- 
pand upon his work in several ways: Instead of a single control operator call/cc, 
we consider a language with multiple layers of exceptions and resources. There- 
fore, on the type level, we have subregioning evidence between nested regions, 
and on the term level, we translate to iterated CPS. 

Our iterated CPS translation of exceptions is closely related to the one pre- 
sented by Schuster et al. [26]. However, they do not support effect-polymorphic 
functions. Our translation to System F is similar to the one for effect handlers 
sketched in Appendix B of Hillerstrém et al. [15]. 


6 Conclusion 


We presented A,, a language with first-class functions, regions, resources, and 
exceptions. Its type system guarantees safe access to resources and safe use of 
exceptions. We then presented a CPS translation that preserves these guarantees. 

We view regions as describing runtime stacks. This view is very much in line 
with recent work on effect handlers. One does wonder if our approach scales to 
more general control effects, which do not discard the current continuation, and 
perhaps even use it multiple times. This is the subject of ongoing investigation. 


Acknowledgments 


The work on this project was supported by the Deutsche Forschungsgemeinschaft 
(DFG — German Research Foundation) — project number DFG-448316946. 


Region-based Resource Management and Lexical Exception Handlers in CPS 517 


References 


[1] 


[2] 


[3] 


[4 


[5 


[6 


[7] 


[8] 


[9 


[10] 


[11] 


[12] 


[13] 


[14] 


Ahman, D., Bauer, A.: Runners in action. In: Miiller, P. (ed.) Program- 
ming Languages and Systems, pp. 29-55, Springer International Publishing, 
Cham (2020) 

Appel, A.W.: Compiling with Continuations. Cambridge University Press, 
New York, NY, USA (1992), ISBN 0-521-41695-7 

Bertot, Y., Castéran, P.: Interactive Theorem Proving and Program Devel- 
opment, Coq’ Art:The Calculus of Inductive Constructions. Springer-Verlag 
(2004) 

Biernacki, D., Piróg, M., Polesiuk, P., Sieczkowski, F.: Abstracting algebraic 
effects. Proc. ACM Program. Lang. 3(POPL), 6:1-6:28 (Jan 2019) 
Biernacki, D., Piróg, M., Polesiuk, P., Sieczkowski, F.: Binders by day, la- 
bels by night: Effect instances via lexically scoped handlers. Proc. ACM 
Program. Lang. 4(POPL) (Dec 2019), https: //doi.org/10.1145/3371116 
Brachthauser, J.I., Schuster, P., Ostermann, K.: Effects as capabilities: Ef- 
fect handlers and lightweight effect polymorphism. Proc. ACM Program. 
Lang. 4(OOPSLA) (Nov 2020), https: //doi.org/10.1145/3428194 

Brady, E.: Idris 2: Quantitative type theory in action. Tech. rep., Uni- 
versity of St Andrews, Scotland, UK (2020), URL https: //www.type- 
driven.org.uk/edwinb/papers/idris2.pdf 

Cong, Y., Osvald, L., Essertel, G.M., Rompf, T.: Compiling with continu- 
ations, or without? whatever. Proc. ACM Program. Lang. 3(ICFP), 79:1- 
79:28 (Jul 2019), https: //doi-org/10.1145/3341643 

Crary, K., Walker, D., Morrisett, G.: Typed memory management in a cal- 
culus of capabilities. In: Proceedings of the 26th ACM SIGPLAN-SIGACT 
Symposium on Principles of Programming Languages, p. 262-275, POPL 
99, Association for Computing Machinery, New York, NY, USA (1999), 
https: //doi.org/10.1145 /292540.292564 

Danvy, O.: On evaluation contexts, continuations, and the rest of compu- 
tation (02 2004) 

Danvy, O., Filinski, A.: Abstracting control. In: Proceedings of the Confer- 
ence on LISP and Functional Programming, pp. 151-160, ACM, New York, 
NY, USA (1990) 

Fluet, M., Morrisett, G.: Monadic regions. In: Proceedings of the Ninth 
ACM SIGPLAN International Conference on Functional Programming, p. 
103-114, ICFP ’04, Association for Computing Machinery, New York, NY, 
USA (2004), https://doi-org/10.1145/1016850.1016867 

Fluet, M., Morrisett, G., Ahmed, A.: Linear regions are all you need. In: 
Sestoft, P. (ed.) Programming Languages and Systems, pp. 7-21, Springer 
Berlin Heidelberg, Berlin, Heidelberg (2006) 

Grossman, D., Morrisett, G., Jim, T., Hicks, M., Wang, Y., Cheney, J.: 
Region-based memory management in cyclone. In: Proceedings of the ACM 
SIGPLAN 2002 Conference on Programming Language Design and Imple- 
mentation, p. 282-293, PLDI ’02, Association for Computing Machinery, 
New York, NY, USA (2002), https://doi.org/10.1145/512529.512563 


518 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


Schuster, Brachthäuser, and Ostermann 


Hillerström, D., Lindley, S., Atkey, B., Sivaramakrishnan, K.: Continuation 
passing style for effect handlers. In: Formal Structures for Computation 
and Deduction, LIPIcs, vol. 84, Schloss Dagstuhl-Leibniz-Zentrum für In- 
formatik (2017) 

Hillerström, D., Lindley, S., Atkey, R.: Effect handlers via gener- 
alised continuations. Journal of Functional Programming 30, e5 (2020), 
https: //doi.org/10.1017 /S0956796820000040 

Kennedy, A.: Compiling with continuations, continued. In: Proceedings of 
the International Conference on Functional Programming, pp. 177-190, 
ACM, New York, NY, USA (2007) 

Kiselyov, O., Ishii, H.: Freer monads, more extensible effects. In: Proceed- 
ings of the Haskell Symposium, pp. 94-105, ACM, New York, NY, USA 
(2015) 

Kiselyov, O., Shan, C.c.: Lightweight monadic regions. In: Proceedings of 
the Haskell Symposium, Haskell 08, ACM, New York, NY, USA (2008) 
Launchbury, J., Peyton Jones, S.L.: Lazy functional state threads. 
In: Proceedings of the ACM SIGPLAN 1994 Conference on Pro- 
gramming Language Design and Implementation, p. 24-35, PLDI ’94, 
Association for Computing Machinery, New York, NY, USA (1994), 
https: //doi.org/10.1145 /178243.178246 

Leijen, D.: Algebraic effect handlers with resources and deep finalization. 
Tech. Rep. MSR-TR-2018-10, Microsoft Research (April 2018) 

Levy, P.B., Power, J., Thielecke, H.: Modelling environments in call-by- 
value programming languages. Information and Computation 185(2), 182- 
210 (2003) 

Moggi, E., Sabry, A.: Monadic encapsulation of effects: a revised approach 
(extended version). Journal of Functional Programming 11(6), 591-627 
(Nov 2001) 

Reynolds, J.C.: Definitional interpreters for higher-order programming lan- 
guages. In: Proceedings of the ACM annual conference, pp. 717-740, ACM, 
New York, NY, USA (1972) 

Schuster, P., Brachthauser, J.I.: Typing, representing, and abstract- 
ing control. In: Proceedings of the Workshop on Type-Driven 
Development, pp. 14-24, ACM, New York, NY, USA _ (2018), 
https: //doi.org/10.1145 /3240719.3241788 

Schuster, P., Brachthauser, J.I., Ostermann, K.: Compiling effect handlers in 
capability-passing style. Proc. ACM Program. Lang. 4(ICFP) (Aug 2020), 
https: //doi.org/10.1145/3408975 

Schuster, P., Brachthauser, J.I., Ostermann, K.: Region-based resource 
management and lexical exception handlers in continuation-passing style 
(technical report). Tech. rep., University of Tübingen, Germany (01 2022), 
https: //se.informatik.uni-tuebingen.de /publications/schuster22region / 
Semmelroth, M., Sabry, A.: Monadic encapsulation in ml. In: Proceedings 
of the Fourth ACM SIGPLAN International Conference on Functional Pro- 
gramming, p. 8-17, ICFP ’99, Association for Computing Machinery, New 
York, NY, USA (1999), https://doi.org/10.1145/317636.317777 


Region-based Resource Management and Lexical Exception Handlers in CPS 519 


[29] Thielecke, H.: Comparing control constructs by double-barrelled 
cps. Higher Order Symbol. Comput. 15(2-3), 141-160 (sep 2002), 
https: / /doi.org/10.1023 /A:1020887011500 

[30] Thielecke, H.: From control effects to typed continuation passing. 
In: Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on 
Principles of Programming Languages, p. 139-149, POPL ’03, As- 
sociation for Computing Machinery, New York, NY, USA (2003), 
https: / /doi.org/10.1145 /604131.604144 

[31] Timany, A., Stefanesco, L., Krogh-Jespersen, M., Birkedal, L.: A logical 
relation for monadic encapsulation of state: Proving contextual equivalences 
in the presence of runst. Proc. ACM Program. Lang. 2(POPL) (Dec 2017), 
https: //doi.org/10.1145/3158152 

[32] Tofte, M., Birkedal, L., Elsman, M., Hallenberg, N., Sestoft, P.: Program- 
ming with regions in the ml kit (for version 4) (10 2001) 

[33] Tofte, M., Talpin, J.P.: Region-based memory management. Inf. Comput. 
132(2), 109-176 (Feb 1997), https: / /doi.org/10.1006 /inco.1996.2613 

[34] Xie, N., Brachthauser, J.I., Hillerstrém, D., Schuster, P., Leijen, D.: Ef 
fect handlers, evidently. Proc. ACM Program. Lang. 4(ICFP) (Aug 2020), 
https: //doi-org/10.1145 /3408981 

[35] Zhang, Y., Myers, A.C.: Abstraction-safe effect handlers via tunneling. Proc. 
ACM Program. Lang. 3(POPL), 5:1-5:29 (Jan 2019) 

[36] Zhang, Y., Salvaneschi, G., Beightol, Q., Liskov, B., Myers, A.C.: Accepting 
blame for safe tunneled exceptions. In: Proceedings of the Conference on 
Programming Language Design and Implementation, pp. 281-295, ACM, 
New York, NY, USA (2016) 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Check for 
updates 


A Predicate Transformer for Choreographies 


Computing Preconditions in Choreographic Programming 


Sung-Shik Jongmans!-?(®)@® and Petra van den Bos? 


1 Department of Computer Science, Open University, Heerlen, the Netherlands 
2 CWI, Amsterdam, the Netherlands 
3 Formal Methods and Tools Group, University of Twente, Enschede, the Netherlands 


Abstract. Construction and analysis of distributed systems is difficult; 
choreographic programming is a deadlock-freedom-by-construction ap- 
proach to simplify it. In this paper, we present a new theory of chore- 
ographic programming. It supports for the first time: construction of 
distributed systems that require decentralised decision making (i.e., if/ 
while-statements with multiparty conditions); analysis of distributed sys- 
tems to provide not only deadlock freedom but also functional correctness 
(i.e., pre/postcondition reasoning). Both contributions are enabled by a 
single new technique, namely a predicate transformer for choreographies. 


1 Introduction 


Construction and analysis of distributed systems that consist of message passing 
processes is hard. Typical challenges include providing deadlock freedom (i.e., the 
processes never get stuck) and functional correctness (i.e., the processes com- 
pute the intended outcome). Choreographic programming [8,9,10] is a deadlock- 
freedom-by-construction approach to make implementation and verification of 
distributed systems easier. In this paper, to address two limitations of existing 
theories, we present a new theory of choreographic programming. It supports for 
the first time: construction of distributed systems that require decentralised 
decision making; analysis of distributed systems to provide not only deadlock 
freedom but also functional correctness. 


1.1 Background: Choreographic Programming by Example 


To explain choreographic programming, consider a distributed system in which 
two processes enact roles Client and Server. First, a username and password are 
communicated from Client to Server. Next, Server checks Client’s credentials and 
informs Client about the outcome: if authentication succeeded, the execution 
continues; if it failed, it ends. We construct and analyse this system as follows: 


1. Initially, we write a global program G (“the choreography”); it prescribes 
the behaviour of all roles, collectively, from their shared perspective. 


C."foo"-»S.x ; C.123 —S.y ; if S.auth (x,y) (S.SUCC—C ; G’) (S.FAIL— C) 
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global program G all construction /analysis 
activities happen here (manual) 
projection A 
all deployment/execution 
local programs Lı Lə ~> Ln activities happen there (automatic) 


Fig. 1: Workflow of choreographic programming 


In this notation, p.e— q.y prescribes a value communication to share data 
from role p to role q: expression e is evaluated at p, sent at p, received at q, 
and stored in variable y at q. Similarly, p.l— q prescribes a label communi- 
cation to share decisions: label £ is actively selected at p (“internal choice”), 
sent at p, received at q, and passively branched on at q (“external choice”). 
Furthermore, G1 ; Gg and if r.e G, G2 prescribe a sequence and a conditional 
choice (i.e., if e is evaluated to true at r, then G is executed, or else G2). 
Now, informally, the first theorem of choreographic programming is this: 


Theorem 1 (Deadlock Freedom). Every global program is deadlock-free. 


2. Subsequently, we decompose global program G into local programs Le 
and Ls (“the processes”), using a projection function; every local program 
prescribes the behaviour of one role, individually, from its own perspective. 


Client: CS!"foo" ;CS!123 ; SC?{succ : L4 , FAIL: skip} 
Server: CS?x ;CS?y ;if S.auth(x,y) (SC!SUCC ; Lg) (SC! FAIL) 


In this notation, pq!e and pq?y prescribe a send and a receive of a value from 
p to q. Similarly, pq!@ and pq? {é; : Li }ier prescribe a send and a receive of a 
label (i.e., if 2; is received for some j € I, then L; is executed). 

Now, informally, the second theorem of choreographic programming is this: 


Theorem 2 (Operational Equivalence). Every well-formed global pro- 
gram is operationally equivalent to the parallel composition of its projections. 


“Well-formedness” is a syntactic condition on global programs; we discuss it 
in more detail later. Here, we just claim that G above is indeed well-formed. 


3. Finally, we compose local programs Lc and Ls in parallel (“the distributed 
system”), by deploying them concurrently, and by executing them at their 
own pace; as they run, Lo and Ls send and receive messages as prescribed. 
Now, Thm. 1 and Thm. 2 together entail that Le and Lg are deadlock-free, 
by construction, without extra analysis. Figure 1 summarises the workflow. 


1.2 Related Work: State of the Art & Open Problems 


Early work on choreographic programming was presented by Carbone et al. 
[8,9] (using binary session types [34]) and by Carbone and Montesi [10] (using 
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multiparty session types [35]); substantial progress has been made since. For in- 
stance, Montesi and Yoshida developed a theory of compositional choreographic 
programming that supports open distributed systems [42]; Carbone et al. stud- 
ied connections between choreographic programming and linear logic [11,12,7]; 
Dalla Preda et al. combined choreographic programming with dynamic adapta- 
tion [48,46,47]; Cruz-Filipe and Montesi developed a minimal Turing-complete 
language of global programs [16,19]; Cruz-Filipe et al. presented a technique to 
extract global programs from families of local programs (“choreography extrac- 
tion”) [14]; and recently, Giallorenzo et al. studied a correspondence between 
choreographic programming and multitier languages [29]. Other work on chore- 
ographic programming includes results on case studies [15], procedural abstrac- 
tions [18], asynchronous communication [17], polyadic communication [20,31], 
implementability [28], and formalisation/mechanisation in Coq [21,22]. Further- 
more, theoretical developments are supported in practice by several tools, in- 
cluding Chor [10], AIOCJ [48,47], and Choral [29]. 
However, all publications cited above have two limitations: 


1. Regarding the construction of distributed systems, existing work on chore- 
ographic programming supports only centralised decision making: every if/ 
while-statement in a global program has a one-party condition, evaluated 
at a single role. For instance, in the example above, the decision to con- 
tinue or end the execution is made by Server alone; Client is duly informed 
afterwards—with a label communication—as it needs to know how to pro- 
ceed, but the decision is really Server’s. 

However, in many distributed systems, it is impractical (i.e., unnecessary or 
unnatural), or even impossible, for a single role to make decisions. 

For instance, consider a distributed system in which two processes enact 
roles Playerl and Player2 to simulate a game of chess. The idea is that, 
at the end of every turn, a move is communicated from “active” Playeri to 
“passive” Playerj, after which a decision must be made: should Playerj take 
a next turn, or is the game over? The key point here is that every role has 
enough knowledge to check if the latest move is, in fact, the final one. So after 
every turn, every role can privately—without a label communication—decide 
to continue or end the execution; moreover, unanimity is guaranteed. It is, 
thus, unnecessary to additionally use a label communication to have one role 
explicitly inform the other one about how to proceed. Yet, all publications 
cited above force the usage of a label communication in this situation anyway. 


2. Regarding the analysis of distributed systems, existing work on choreo- 
graphic programming focusses on providing deadlock freedom. In contrast, 
providing functional correctness has not received due attention. This is sur- 
prising: given the sequential programming style in which global programs are 
expressed, it seems worthwhile to study how classical verification techniques 
for sequential code can be adapted to choreographic programming. 


Beyond choreographic programming, all other choreography-based approaches 
that we know of are limited to centralised decision making, including conversa- 
tion protocols (e.g., [3,27]), multiparty session types (MPST) (e.g., [35,13,23,24]), 
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Table 1: State of the art (e.g., [9,10,12,19,29,42,47]) vs. this paper 


state of the art this paper 
construction 

decisions centralised decentralised 
conditions one-party multiparty 

syntax if r.e Ginen Gelse if A{r.er}rer Gtnen Gelse 
eeano 1. Bx2—>A.y1; 1. B.x2— A.y1 ; A.x1—B.y2 ; 

( a 2. if A.x1==y1 2. if A.x1==y1 A B.x2==y2 

: 3 oer 3. A.SUCC->B: Ginen 3- Gis 
prog 4.  A.FAIL—>B ; Gelse a Gase 
analysis deadlock freedom deadlock tresdoñi & 


functional correctness 


and MPST extensions to support value-based reasoning using assertions [5], de- 
pendent types [51,25], and refinement types [52]. Furthermore, we note that (el- 
ements of) deductive verification and session types were combined in Actris [32] 
and ParTypes [41]. Actris supports reasoning about functional correctness (us- 
ing separation logic [44,36]), but only for binary sessions. In contrast, ParTypes 
supports multiparty sessions, but it does not consider functional correctness. 


1.3 Contributions of This Paper 
In this paper, we address the two limitations described in Sect. 1.2. 


1. Construction: We present a new theory of choreographic programming 
that supports decentralised decision making: every if/while-statement has a 
multiparty condition, evaluated at multiple roles. 


2. Analysis: The new theory ensures that if the precondition is true in the 
initial state of a global program, then after executing the global program, 
the postcondition is true in the final state. Similar to deadlock freedom, this 
form of functional correctness is conferred from the global program to the 
parallel composition of its projections, by operational equivalence. 


Table 1 summarises our contributions relative to the state of the art; it also 
shows a minimal example to illustrate the essential difference between centralised 
decision making and decentralised. With centralised decision making (left global 
program), first, only Bob shares x2 with Alice; next, only Alice compares it with 
x1 and shares the outcome with Bob. In contrast, with decentralised decision 
making (right global program), first, both Alice and Bob share their values; next, 
both Alice and Bob compare them, but they do not need to share the outcomes, 
as their unanimity is guaranteed. 


1.4 Key Challenge: How to Check If Unanimity Is Guaranteed? 


So far, we have seen two examples of decentralised decision making (i.e., Player1 
and Player2 in Sect. 1.2; Alice and Bob in Sect. 1.3). In both examples, we noted 
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that “unanimity is guaranteed”; this is crucially important to provide deadlock 
freedom. As a counterexample of what can go wrong in the absence of unanimity, 
suppose that Bob’s condition in Tab. 1 were x2==true (i.e., he ignores Alice’s 
value). In that case, unanimity is not guaranteed, so Alice and Bob can diverge: 
Alice privately decides to enter one branch, while Bob privately decides to enter 
the other branch. A deadlock subsequently ensues if, for instance, Alice needs 
to await a message from Bob in her branch, while Bob needs to await a message 
from Alice in his branch. 

Thus, the key challenge to support decentralised decision making in chore- 
ographic programming is this: “How to check if unanimity is guaranteed?” The 
pivotal insight is that this question can be reduced to a seemingly unrelated one: 
“Given a global program and a postcondition, how to compute a precondition?” 
It was first answered for sequential code by Dijkstra in the 1970s [26], in terms 
of a predicate transformer to compute weakest preconditions. A crucial technical 
contribution of this paper is a non-trivial adaptation of Dijkstra’s seminal work, 
tailored for choreographic programming, to provide not only functional correct- 
ness (i.e., ensure the truth of the postcondition) but also deadlock freedom in 
the presence of decentralised decision making (i.e., ensure unanimity). 


1.5 Organisation of This Paper 


In Sect. 2, to further motivate this paper’s new theory, we present more examples 
of real(istic) distributed systems that require decentralised decision making. 

The new theory is presented in Sects. 3-7: in Sect. 3, we present some pre- 
liminaries; in Sect. 4, we present a base calculus of global programs, without if/ 
while-statements, but with a main theorem that covers both deadlock freedom 
and functional correctness; in Sect. 5 and Sect. 6, to support decentralised de- 
cision making, we extend the base calculus with if/while-statements; in Sect. 7, 
we present a calculus of local programs and projection. Thus, Sect. 4-6 cover 
the upper half of Fig. 1, while only Sect. 7 covers the bottom half. 

Appendices appear in the full version of this paper [39]. Detailed definitions, 
auxiliary lemmas, main theorems, and proofs appear in a technical report [40]. 


2 Motivating Examples 


To further motivate the usefulness and necessity of this paper’s new theory, 
in this section, we present examples of real(istic) distributed systems that re- 
quire decentralised decision making; see Appx. A [39] for additional examples. 
Throughout the section, we adopt a programmer’s perspective and present only 
global programs (i.e., all construction and analysis activities that a programmer 
carries out manually in the workflow, happen in the upper half of Fig. 1). 
Regarding the usefulness of the new theory, the following example shows that 
centralised decision making can be impractical (i.e., unnatural or unnecessary). 


Example 1 (Chess simulation). From Sect. 1.2, recall the distributed system in 
which two processes enact roles Playerl and Player2 to simulate a game of chess. 
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1. Pl.b:=board() ; P2.b:=board() ; 1. P1.b:=board() ; P2.b:=board() ; 
2. while P1.! done (b) 2. while P1.!done(b) A P2.!done(b) 
3. (P1.CONTINUE-> P2; Gio; 3. (Gia; 
4. if P2.!done(b) 4 if P1.!done(b) A P2.!done(b) 
5. (P2.CONTINUE— P1 ; G21) 5. Ga 
6. (P2.END —>P1 ; skip)) ; 6 skip) 
7. P1.END — P2 

(a) Centralised (b) Decentralised 


Fig. 2: Global programs for chess simulation (Exmp. 1) 


Figure 2 shows two global programs: one that uses centralised decision mak- 
ing (at Playerl and Player2, in alternating order), and one that uses the new 
theory’s decentralised decision making; both have auxiliary global programs G2 
(Player1 is active, Player2 is passive; details omitted) and G21 (vice versa). 

In Sect. 1.2, we argued for the usefulness of decentralised decision making in 
this example: the label communications in Fig. 2a are actually unnecessary. 


Regarding the necessity of the new theory, the following example shows that 
centralised decision making can be impossible. In the example, notation G1 || G2 
prescribes an interleaving; it is used to express that the order in which G and G2 
are executed does not matter (i.e., it is not intended to be multi-threading; there 
is no interaction between G and G2). By convention, sequencing binds stronger 
than interleaving. For instance, G1 ; G2 || G3 should be read as (G4 ; G2) || G3. 


Example 2 (Probabilistic leader election in anonymous clique networks). Con- 
sider a distributed system in which k anonymous processes (i.e., they have no 
predefined identifiers) need to elect a leader among them. For clique networks 
(i.e., each process has a channel to each other process), a probabilistic version 
of Peleg’s algorithm [45] can be used in the style of Itai and Rodeh [37,38]. The 
algorithm proceeds in rounds. In every round, every process picks a random iden- 
tifier and sends it to every other process. If there is a unique maximal identifier, 
then the process that picked it becomes the leader. If not, another round follows. 

Figure 3 shows a global program for k=3; it crucially relies on the new the- 


ory’s decentralised decision making. We write r.[71,...,@n]:=[e1,...,€n] to ab- 
breviate 7.71 :=€1 5 +++ j T.Ln:= €n, while we write p.e— |q1.£1,.--,qn-£n] to ab- 
breviate p.e— q1.£1 ; +++ ; p-€— qn-£n. First, the processes initialise five variables 


(lines 1-3): seed is used to pick random identifiers; id1, id2, and id3 are used to 
store and compare identifiers; leader indicates whether or not the process was 
elected. Next, the processes enter the loop (lines 4-7), each of whose iterations 
represents one round: in every iteration, every process increments its seed, picks 
a random identifier, and shares it. When the maximal identifier is unique, the 
processes exit the loop. One process marks itself as leader (lines 8-10). 

The point of this example is that the probabilistic version of Peleg’s algorithm 
for cliques—actually, any leader election algorithm—cannot faithfully be imple- 
mented using centralised decision making. The reason is that centralised decision 
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. (P1.[seed, id1, id2, id3, leader] :=[-1,-1,-1,-1, false] || 
P2.[seed, id1, id2, id3, leader] :=[-1, -1, -1, -1, false] || 
P3.[seed, id1, id2, id3, leader] :=[-1,-1,-1,-1,falsel]) ; 


while /A{r.!maxIsUnique(id1,id2,id3) }re{P1,P2,P3} 

(Pl.seed:=seedt1 ; Pl.id1:=random1 (seed) ; Pl.id1— [P3.id1, P2.id1] || 
P2.seed:=seedt1 ; P2.id2:=random2(seed) ; P2.id2— [P1.id2, P3.idQ] || 
P3.seed:=seedt1 ; P3.id3:=random3(seed) ; P3.id3 — [P2.id3, P1.id3}) ; 

. if N{r.id1 = aie at: id2,id3) }re{P1,P2,P3} (P1. leader := true) (skip) ; 
. if A{r.id2 == max(id1,id2,id3) }re{P1,P2,P3} (P2.leader := true) (skip) ; 
. if A{r.id3 == max(id1,id2,id3) }re{P1,P2,P3} (P3.leader := true) (skip) 


OPNP NH 


m= 
© 


Fig. 3: Global program for probabilistic leader election in anonymous clique net- 
works (k=3), using decentralised decision making 


making inherently requires the presence of a distinguished process (to evaluate 
a one-party condition and share the outcome). However, the motivation to run 
a leader election algorithm in the first place is that such a distinguished process 
is not yet agreed upon. That is, centralised decision making requires asymmetry 
of processes, whereas leader election algorithms require symmetry. 


3 Setting the Stage: Data and Conditions 


The topic of interest in this paper is “processes that communicate”, rather than 
“data that are communicated”. For this reason, we assume that there exists some 
underlying calculus of data (Sect. 3.1), but we omit most of its details; they are 
orthogonal to this paper’s contributions. On top of it, we adopt a logic to write 
preconditions, postconditions, and conditions in if/while-statements (Sect. 3.2). 


3.1 Data 


Let R = {A,B,C,...} denote a universe of roles, ranged over by p,q,r. Let 
X = {x,y,z,...} denote a universe of variables, ranged over by x,y,z. Let V = 
{error, true, false, 0, 1,2,...} denote a universe of values, ranged over by v 
(i.e., V contains at least a distinguished value error, booleans, and numbers, 
but we also use other data types in examples, including functions). Let E denote a 
universe of expressions, ranged over by e; it is induced by the following grammar: 


en= Tr | v | €1==e2 | €1<e2 | Ee, &&e2 le | eitez 
Ne 
role-qualified variable compound expressions 


Let S = R — (X — V) denote a universe of states (i.e., partial functions from 
roles to partial functions from variables to values), ranged over by S; the idea is 
that every state has a separate section for every role of interest, to model disjoint 
memory spaces. Let eval : S x E —> V denote a total evaluation function. For 
instance, evalya.+{x+5,y-6}}(A-x+A.y) = 11. We assume that bogus expressions 
are evaluated to error. For instance, evalg(1+true) = error. 
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Regarding terminology, we say that every role-qualified variable r.x is “local 
to r”. If every role-qualified variable that occurs in e is local to r, then e is “local 
to r”. Regarding notation, if e is local to r, then we often move all “r.’-qualifiers 
that occur in e to the front. For instance, we write A.xty instead of A.xtA.y. 


3.2 Conditions 


We adopt the following basic logic over expressions in E. Let Y denote a universe 
of formulas, ranged over by ¢, x, Vv; it is induced by the following grammar: 


OXY == € | ay | pı A | Vw 
Informally, given state S, formulas have the following meaning relative to S: 


— Formula e is an atom: it is true in S iff e evaluates to true using S. 

— Formulas ~y and %1 A w2 are a negation and a conjunction, as usual. 
(Negation and conjunction appear also at the level of formulas, and not just 
at the level of expressions, for technical convenience later on in this paper.) 

— Formula Vw is a tautology: it is true in S iff y% is true in every state. 


Formally, an interpretation function maps formulas to the sets of states in which 
they are true, denoted by [-]; it is induced by the following equations: 


[e] = Iy] =S\ WW] 
{S | evals(e) = true} [v1 A be] = Yi N [ve] 


Regarding terminology, if every expression that occurs in w is local to r, then 
w is “local to r”; if so, the truth of w can be checked at r. Regarding notation, we 
often write \{r}re{ri,...,rn} instead of Yr, A+A Wr, if Yr is local to r for every 
r € {r1,..., Tn}. Furthermore, we write Yı V v2 and Yı — Y2 for disjunction 
and implication. Finally, we write Yı = Y2 instead of [vı] = [y2]. 


S if [J] =S 


Ø otherwise 


[vy] = 


4 Global Programs: Base Calculus 


To gently introduce the main components of the new theory, in this section, we 
present a base calculus of global programs, without if/while statements, but with 
a main theorem that covers both deadlock freedom and functional correctness. 

Initially, we present the syntax and semantics (Sect. 4.1); subsequently, we 
present a predicate transformer (Sect. 4.2); finally, we present the main theorem, 
which relies on the predicate transformer (Sect. 4.3). In the next sections, we 
extend the base calculus to support decentralised decision making. 


4.1 Syntax and Semantics 


Let T and G denote universes of global actions and global programs, ranged over 
by y and G; they are induced by the following grammar: 
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yu= gyre | Dp.e—>q.y G ::= skip | y | Gy; Go | G, || Ge 
Informally, these grammar elements have the following meaning: 


— Global action g.y:=e models an assignment of the value of expression e to 
variable y at role q. As an extra constraint, e is local to q. Regarding notation, 
we often omit “g.’-qualifiers from e. For instance, we write A.z:=x+y instead 
of A.z:=A.xtA.y. Also, we write evals(q.y:=e) instead of g.y:=evals(e). 


— Global action p.e—q.y models a synchronous communication of the value 
of expression e at role p into variable y at role q. As extra constraints, e is 
local to p, and p Æ q. Regarding notation, we often omit “p.’-qualifiers from 
e. Also, we write evals(p.e—>q.y) instead of p.evals(e) > q@.y. 


— Global program skip prescribes an empty execution. 


— Global program G1 ; G2 prescribes a weak sequence of Gı and G2. The idea 
is that it resembles a conventional strong sequence (i.e., in-order execution), 
except that it also allows global actions in Gə that are independent of those 
in G, to be executed already before G4 is done (i.e., out-of-order). 

For instance, in A.x:=5;B.y:=6, the assignment at Bob is independent of 
the assignment at Alice, so they may be executed out-of-order. In contrast, 
in A.x:=5;A.x+1—B.y, the communication from Alice to Bob depends on 
the assignment at Alice, so they must be executed in-order. In general, when 
two global actions have disjoint subjects (i.e., participating roles), they are 
considered independent and may be executed out-of-order. 

Out-of-order execution of global actions with disjoint subjects is common in 
choreographic programming: it was first introduced by Carbone and Montesi 
to deal with latent concurrency among roles in global action sequences [10]. 


— Global program G; || G2 prescribes an interleaving of G, and Go. 


Formally, we define the operational semantics of global programs at two “layers”. 
(1) The “top layer” consists of an abstract termination relation, denoted 
by |, and an abstract labelled reduction relation, denoted by — in the style of 
process algebra (e.g., [2]). More precisely, G | means that G can terminate, while 
G £7, G' means that G can reduce to G” when w is true (i.e., conditionally) by 
executing y. For instance, the following abstract execution is possible: 


true,A.x:= true,A.x+1—B.y 


A.x:=5 ; A.x+1—>B.y 5, skip ; A.x+1—>B.y > skip ; skip | 


First, the global program reduces by executing an assignment; next, it reduces 
by executing a communication; next, it terminates. For simplicity, skips are not 
automatically cleaned up by the reduction rules (but they could be). 

Relations | and — are induced by the rules in Fig. 4a. Most rules are standard 
[2]. Notably, in this section, every reduction is unconditional (i.e., labelled with 
true) due to rule [>-ActT]. The only special rule is rule [+-SEQ2]: it states that 
if G2 can reduce to G4 by executing y (right premise), and if y is independent 
of G (left premise), then Gy ; G2 can reduce accordingly (conclusion). We note 
that independence is defined in terms of disjointness of subjects, as explained 
above. For instance, the following abstract out-of-order execution is possible: 
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Gil Gol Gil Gel a = true 
-S eh aea ie S N 
sip S] “Giga M afaa H Tea aap OA 
Gi 4 Gi subj(Gi) N subj(7) =0 Go 2% Gh 
J [—>-SeEQ1] 7 [+-SEQ2] 
Gi; G2 > Gi; G2 Gi ; Go 4 G1 ; Gh 
Gi = G; ee, 
TA ; [3-Parl| TA f [—-ParR2] 
Gi || G2 —> Gi || G2 Gi || G2 —> Gi || G2 
(a) Base calculus 
= a = 1" = \f-e,}, =2F 
w= Afer}rer lis otey) w= \{-er}rer T 2 TES 
if Merjrer Gi G2 => Gi if N{er}rer Gi Go —> Go 
— _4R 
a — Mer}rer = l [>-WunneE1] 


while A{er}rer {Vin} G 4 G ; while A{er}rer {Vin} G 


= N — R 
vz M er}rer Y = [>-WHILE2] 
while A{er}rer {Vin} G A skip 


(b) Extension with if/while-statements — explained in Sect. 5 


R= Ri U Rə Ri Æ Q implies Gi) Re 4% implies Gol 


{-NIF] 
if A{er}rer Gilr; Gal 4 
B _ ir} 
ee A, p=er_ y=1 »-NIF1] 
if A{er}rer Gila: Galas => if Mer}rer Gilriutr} Gales 
=a — {r} 
rE R\(RıUR2) pare y=2 »-NIF2| 


if Merbrer Gile, Golera £9 if Merbrer Gila, Galraurry 


Gi Ay Gi subj(y) C Ri \ Re iS Nir5] 
>- Fe 
if N{erbrer Gil, Galan, 3 if Mer}rer Cll, Gala, 


py 1 : 
C 
G2 — G2 S C Ro\ Ri [>-NIr4] 
if Af{er}rcr Gilr, Gale, > if Afer}rer Gilr, Gblr, 


if A{er}rer (G; while Af{er}rer {inv} Glo)lo skip|g => G’ 
: be. oj [—-NWute] 
while A {er}rer {Vinv} Glo —> G 


(c) Extension with non-blocking if/while-statements — explained in Sect. 6 


Fig. 4: Abstract operational semantics of global programs (“top layer”) 


530 S. Jongmans and P. van den Bos 


GI GPG Se ly] 7° = evals(7) È effect(q.y:=v,S) =S|v/q.y] 
(G,S)4 (G, S) 25 (G",effect(7°, S)) effect(p.v = q.y, S) = S[v/q.-y] 


S[v/q-y] = {r > S(r) |g #r}U {g {x Sx) | x Ay U {y = vh} 


Fig. 5: Concrete operational semantics of global programs (“bottom layer”) 


true,B.y:=6 true,A.x:=5 
—=? 


A.x:=5;B.y:=6 A.x:=5 ; skip ————> skip ; skip | 


(2) The “bottom layer” consists of a concrete termination predicate, denoted 
by | (same symbol as before), and a concrete labelled reduction relation, denoted 
by — (ditto). The idea is that the bottom layer enriches the top layer by taking 
into account states, in terms of configurations of the form (G, S). More precisely, 
(G,S) | means that G can terminate in S, while (G, S) 7+ (G’,S’) means that 
G can reduce to G” by executing 7° in S to obtain S’. We write ~°—with a 
superscript “c’—to indicate that it is a “concrete” global action in which every 
expression has been evaluated to a value (using S). For instance, the following 
concrete execution is possible: 


(A.x:=5 ; A.x+1—B.y, {A > {x > 0},B > {y= 0}}) 
a (skip ; A.x+1—B.y, {A > {x > 5},B > {y > 0}}) 
Aea, (skip ; skip, {A > {x4 5},B > {y > 6}}) 4 


Relations | and — are induced by the rules in Fig. 5. Rule [|| states that if G 
can terminate, then so can (G, S), regardless of S. More interestingly, rule |>] 
states that if G can reduce to G” when w is true by executing y (left premise), 
and if y is indeed true in S (middle premise), and if 7° is the “concretisation” of y 
such that every expression is first evaluated using S (right premise), then (G, S) 
can reduce accordingly, and the effect of 7° is applied to S (conclusion); the latter 
means that a variable is bound to a new value in S, formalised using “substitution 
notation”. For instance (cf. second reduction in the concrete execution above), 
if S = {Aw {x > 5},B > {y > O}}, then effect(evals(A.xt1-B.y),S) = 
effect(A.6 > B.y,S) = {A > {x 5},B > {yb 6}}. 

Our formalisation of the operational semantics has two novelties: 


— Two-layered approach — In existing work on stateful choreographic program- 
ming (e.g., [14,19]), abstract and concrete operational semantics are merged 
into one. An advantage of keeping them separate is that it enables us to 
prove the main theorems also in a layered fashion; this simplifies our proofs. 


— Semantic reordering — In existing work on choreographic programming (e.g., 
[10,42]), weak sequencing is formalised using a structural congruence rela- 
tion in the style of pi-calculus (e.g., [50]), including special “swap rules” to 
syntactically reorder independent global actions. In contrast, rule [>-SEQ2] 
semantically reorders them; this simplifies our proofs. Our approach, inspired 
by Rensink and Wehrheim [49], essentially generalises the formalisation of 
asynchronous action prefixing in multiparty session types [24]. 
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R40 [v-Sxır] 4ER [V-Acr1] _pPq4ER [V-Acr2] 
Vr(skip) vr(q-y:=e) Vr (p-e4-Y) 
VR(Gi) VR(G2) Vr (Gi) Vr(G2) chan(Gi)M chan(G2) =0 
oo [v -SEQ] [V-PaR] 
VR(Gi ; G2) VR(Gi |l G2) 


(a) Base calculus 


V¥R(Gi) VR(G2) [1r] ¥R(G) 
Vr(if Mer}rer Gi G2) Vr(while A{er}rer {Vinv} G) 


[V-WHILE] 


(b) Extension with if/while-statements — explained in Sect. 5 


VR(G1) VR(G2) Ri, R2 Cc R 
Ri #9 implies R2 = 0 
R2 #9 implies Ri = 0 VR(G) 
Valif Afer}jrer Gila, Gola) | Va(while A{er}rer (Vin) Glo) 


[V-NWHILE] 


(c) Extension with non-blocking if/while-statements — explained in Sect. 6 


Fig. 6: Well-formedness of global programs 


We end this subsection with a well-formedness relation, denoted by v, to 
check a few basic syntactic properties of global programs; it is induced by the 
rules in Fig. 6a. For now, there are two aims (to be extended in subsequent 
sections for if/while-statements): 


1. Rules [M-AcT1] and [“-AcT2] ensure that R contains all roles that occur 
in G. The idea is that when we project G onto every role in R (Sect. 7), 
we get a local program for every remaining subject of G (i.e., when G is the 
remaining global program, R may contain roles that participated in the past, 
but no longer in the future). Thus, R spans the whole distributed system. 


2. Rule [V-PAR|] ensures that the channels (i.e., sender—receiver pairs) that oc- 
cur in G, and Gə are disjoint; this is a common assumption in choreographic 
programming (e.g., [8]). The idea is that when a communication happens in 
G, || G2, it must be unambiguously clear whether it happened in G4 or in Go; 
otherwise, the operational equivalence theorem cannot be proved (Sect. 7). 


4.2 Predicate Transformer 


In the next subsection, the main theorem for global programs will be as follows 
(informally): if the global program is well-formed, and if the precondition is 
true in the initial state, then deadlock freedom and functional correctness are 
provided. In this subsection, we present a technique to automatically compute 
preconditions such that the main theorem can indeed be formulated and proved. 
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o(skip,y) =x (if \{er}rer Gi G2, x) = 
o(ay:=e,x) = xle/q-y] (AL er}rer > (G1, xX)) A 
(pe ay, x) = xle/4-y] (A{er}rer > (G2,x)) A 
b(G1 ;G2,X) = (G1, (G2, x)) (Afer > ers frir2eR) 
(Gi, (Go, x)) (while A{er}rer {Vin} G, x) = 
OGG.) =) oe i ails 
f false (A{ errer > (G, Winv)) A 
otherwise (A{-er}trer > x) ^ 
(a) Base calculus (Meri > ra}rs roeR))) 


(b) Extension with if/while-statements 
— explained in Sect. 5 


o(if A{er}rer Gi Go, x) if: Ry = (f) = Ro 
i (Go, x) A {rer }reR\Re if: Ry = ff) x Ro 

f EE 1 29 = ; 
o(if A{er}rer Gilr, G2ļ|r2;X) (G, AAL erjrer\R, if: Ri 4 0= Ro 
false if: Ri A 0 # Rə 

b(while \{er}rer {Vinv} Glo, x) = o(while A{er}rer {Winv} G, x) 


(c) Extension with non-blocking if/while-statements — explained in Sect. 6 


Fig. 7: Predicate transformer to compute preconditions 


Let denote a predicate transformer function; it is defined by the equations 


in Fig. 7a, where x[e/q.y] denotes substitution of e for q.y in x. In words, p 
consumes a global program G and a postcondition y as input, and it produces a 
precondition 6(G, x) as output. The idea is that ọp is sound: if @(G, x) is true in 
the initial state, then after executing G, x is true in the final state. Essentially, 
Fig. 7a is an adaptation of Dijkstra’s predicate transformer to compute weakest 
preconditions for sequential code [26], denoted by wp. More precisely: 


— For q.y:=e, the definitions of @ and wp are the same; for p.e—>q.y (absent 
in Dijkstra’s work), @ works similarly. Figure 8a shows an example: if A.x is 
5 (computed precondition), then after the communication of A.x+1 at Alice 
into B.y at Bob (global program), the sum of A.x and B.y is 11 (postcon- 
dition). We note that the postcondition relates variables at different roles; 
this is straightforwardly supported by ọọ, without extra manual effort. 


For G, ; Go, the definitions of pọ and wp are the same as well: first, x is used 
as a postcondition of Gz to compute a precondition (G2, xX); next, (G2, x) 
is used as a postcondition of G; to compute a precondition (G1, 6(G2, x)). 
Such a “backwards” computation of a precondition corresponds to the “for- 
wards” execution of the sequence: initially, (G1, 6(G2, X)) is true; subse- 
quently, 6(G2, x) is true after executing G4; finally, x is true after executing 
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b(A.x+1— B.y, A.x+B.y==11) p(y ; A.x+1— B.y, A.x+B.y==11) 


= A.xtA.x+1==11 = (7, O(A.xt+1 — B.y, A.x+B.y==11)) 
= A.xtA.x==10 = A.x== = (7, A.xtA.x+1==11) = 5+5+1==11 = true 
(a) Communication (b) Sequence 
(7; if (A.x==5 A B.y==6) B.y:=7 skip, x) 
(y, @(if (A.x==5 A B.y==6) B.y:=7 skip, y)) 


=o 
= (7, (A.x==5 A B.y==6 > 1) A (AA.x==5 A AB.y==6 > ¢2) A (A.x==5 © B.y==6)) 
==5 A B.y==6 > ¢1[5/A.x]) A (-5==5 A B.y==6 — ¢2[5/A.x]) A (5==5 + B.y==6) 


(5 
= (B.y==6 > ¢1[5/A.x]) A (false > ¢2[5/A.x]) A B.y==6 = ¢1[5/A.x] A B.y==6 


(c) Conditional choice — explained in Sect. 5. Let ¢1 = o(B.y:=7, x), ¢2 = (skip, x). 


Fig. 8: Examples of p. Let y = A.x:=5. 


Gy. Figure 8b shows an example: if true is true (i.e., unconditionally), after 
executing the global program, the sum of A.x and B.y is 11. 

However, unlike Dijkstra’s setting (i.e., strong sequencing), there is a caveat 
in our setting (i.e., weak): G} and Gg may be executed out-of-order. This 
makes proving the soundness of @ more challenging than in Dijkstra’s work 
(notably: establishing the correspondence between backwards computation 
of a precondition and forwards execution of the sequence). 


— For Gj || G2 (absent in Dijkstra’s work), the definition of # is inspired by 

the notion of disjoint parallelism in Hoare logic [33,1]. There are two cases. 
If Gy and G2 are non-interfering, which means that the variables that occur 
in G, and Gho are disjoint, denoted as G1 # Go, then the order in which Gj 
and G2 are executed does not affect the truth/falsehood of the postcondition; 
in that case, a precondition is computed by assuming, arbitrarily, in-order 
execution of G and G2 (but any other interleaving would work as well). 
If Gı and G2 are interfering, then ¢ yields false, so no state satisfies the pre- 
condition. This is sound but not complete (i.e., there exist deadlock-free and 
functionally-correct global programs for which the computed precondition 
is nevertheless false). For our present purposes, however, is “complete 
enough” (e.g., all examples in Sect. 2 and Appx. A [39] are supported).* 


The following proposition follows almost directly from the definitions. It states 
that if p(y, x) is true in S, then x is true in S’, after executing y. 


Proposition 1. If S € [b(y7, x)] and S’ = effect(evals (y), S), then S’ € Ix]. 


4 Even though ¢ requires non-interference, interleaving (||) offers additional expressive 
power beyond weak sequencing (;). This is because non-interference (for ||) is defined 
in terms of disjointness of variables, whereas independence (for ;) is defined in terms 
of disjointness of roles. For instance, A.x:=5 and A.y:=6 are non-interfering, but 
not independent. Consequently, A.x:=5 || A-y:=6 allows the assignments to happen 
in any order, whereas A.x:=5; A.y:=6 requires them to happen from left to right. 
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4.3 Deadlock Freedom and Functional Correctness 


The aim of this subsection is to formulate and prove the main theorem for global 
programs, which covers both deadlock freedom and functional correctness. 

To give a uniform presentation across Sects. 4—6, we formulate the lemmas 
and theorem for the base calculus in this section in a way that they are reusable— 
verbatim—for the extensions in the next sections. As a result, some formulations 
are more restrictive than necessary for the base calculus, but this is fine. 

The first two lemmas pertain to þ’s soundness. The first lemma states that if 
G is well-formed and can terminate, then the truth of @(G, x) implies the truth 
of x (i.e., the postcondition has been brought about). The second lemma states 
that if G is well-formed and can reduce to G” when w is true by executing y, 
then the truth of @(G,y) A Y implies the truth of x, after executing 7; G” (ie., 
the postcondition is being brought about by executing y). 


Lemma 1. If V/p(G) and G}, then [(G, x)] € [x]. 


Proof. By induction on the derivation of GJ. 


Lemma 2. If /p(G) and G £% G', then [6(G, x) Av] E lo; G’,x)]. 


Proof. By induction on the derivation of G sa G”. The interesting case is rule 
[>-SEQ2], with G = G1 ; G2. We need to prove the following inclusions: 


[(G1, (G2, x)) AY] E [6(G1 57; G5,x)] E [b(7; G1 ; G4, x)] 


The first inclusion can be proved using the induction hypothesis and G2 ala Gs 
(right premise of rule [>-SEQ2]). The second inclusion can be proved using 
subj(G1) N subj(y) = Ø (left premise) and Vp(G), to establish that the variables 
that occur in G4 and y are disjoint as well (i.e., G; and y are non-interfering). 


The next lemma states that well-formedness is preserved by reduction. 


Lemma 3. If Vn(G) and [6(G, y)] £0 and G 22 G’, then Vp(G’). 


Proof. By induction on the derivation of G KATEA 


The next lemma states that if G is well-formed, and if b(G, x) is true in S, 
then either G can terminate, or G can reduce to G’ (i.e., G is not stuck). 


Lemma 4. If VR(G) and S € [b(G, X)], then either G}, or there exist w,7,G' 
such that G =% G’ and S € fy]. 


Proof. By induction on the derivation of vR(G'). 


Now, our main theorem for global programs states that if G is well-formed, 
and if (G, x) is true in S, and if (G, S) has a sequence of reductions to (GT, St), 
then either (Gt, St) can terminate and y is true in St, or (GT, S1) can reduce. 
Thus, an execution of (G,S) consists of either finitely many reductions, followed 
by termination, or infinitely many (i.e., deadlock freedom); in the former case, 
upon termination, the postcondition is true (i.e., functional correctness). 
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Theorem 3. If ¥g(G) and S € [(G, X] and (G,S) Us... ™ (Gt, St), then: 


1. Either (G',S')|, or there exist y°,G*,S* such that (Gt, S') 1; (Gt, St). 
2. If (Gt, St), then St € fx]. 


Proof. First, we inductively apply Prop. 1 and Lems. 2-3, along the reduction 
sequence to prove VR(G') and St € [p(Gt, y)]. Next, we apply Lem. 4 to prove 
deadlock freedom and Lem. 1 to prove functional correctness, using Fig. 5. 


5 Global Programs: If/While-Statements 


In the previous section, to gently introduce the main components of our theory, 
we presented a base calculus of global programs. In this section, we extend it 
with if/while-statement to support decentralised decision making. 


5.1 Syntax and Semantics 


Recall that T and G denote universes of global actions and global programs, 
ranged over by y and G; they are induced by the following extended grammar: 


y = +++ (page 8) | am 
G ::= --+ (page 8) | if A\f{er}rer Gi G2 | while A{er}rer {Vin} G 


Informally, the new grammar elements have the following meaning: 


— Global action i”, with i € {1,2}, models a collection of private decisions 
at all roles in R together (i.e., at the same time). In case of an if-statement, 
i=1 and i=2 indicate entering the then-branch and else-branch; in case of a 
while-statement, i=1 and i=2 indicate (re)entering the loop and exiting it. 


— Global program if /\{e,},eRr Gi G2 prescribes a conditional choice of G1 
and G3. The idea is that every role r € R privately evaluates its own conjunct 
e, of multiparty condition A{er}rer and, based on the outcome, privately 
decides to enter G or G2. As a result, we have three cases to consider: 


e Case A: If e, is true for every r € R, then everyone enters G4. 

e Case B: If e, is false for every r € R, then everyone enters G2. 

e Case C: If e,, is true, but er, is false, for some r1, r2 € R, then someone 
enters G1, but someone else enters Go. 


Cases A and B are the “good” situations in which the roles are unanimous. 
In contrast, case C is the “bad” situation that leads to deadlock. 

For simplicity, in this section, we assume that roles make private decisions 
together (i.e., at the same time), using two synchronisation barriers. Intu- 
itively, in operational terms, this means that for every role r: first, it privately 
evaluates its own conjunct er; next, it reaches one of two barriers, depending 
on the truth/falsehood of ep; next, it waits until every other role has pri- 
vately evaluated a conjunct and reached a barrier as well. In cases A and B, 
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all roles eventually reach the same barrier, so it breaks, and all roles enter 
one branch together; in case C, the roles never reach the same barrier—they 
are divided—so neither one of them breaks, and the roles get stuck. 

(We note that barriers are often undesirable in distributed systems. In the 
next section, therefore, we also extend the base calculus with barrier-free 
if/while-statements. However, as the technical details of the barrier-free ver- 
sions are considerably more complicated than the barrier-based versions, but 
partly rely on similar principles, we present the barrier-based ones first.) 
An if-statement cannot terminate: a decision must be made. 


— Global program while /A{e-},er {Winv} G prescribes a conditional loop 
of G. The idea is similar to if A{e,},-cr Gi Ge, including non-termination. 
Condition wWiny is the loop invariant; it does not affect the operational se- 
mantics of while-statements, but it is used to compute preconditions. 


Formally, for if/while-statements, — is induced by the rules in Fig. 4b (page 10). 
The presence of rules [-IF1] and [—-IF2] corresponds to cases A and B, whereas 
the absence of other rules corresponds to case C (i.e., there are no reductions 
when roles are not unanimous). For instance, when G = A.x:=5 ; if (A.x==5 A 
B.y==6) B.y:=7 skip, the following two abstract executions are possible: 


true, A.x==5/AB.y==6, true, true, 7A. x==5A34B.y==6, 
: y B.y:=7 A.x:=5 21A,B} 
>o >e) G >o >e] 


First, G reduces by executing an assignment at Alice (both executions); next, 
it reduces by executing private decisions at Alice and Bob together to enter the 
then-branch (left execution) or else-branch (right); next, in the former case, it 
reduces by executing an assignment at Bob and terminates, whereas in the latter 
case, it terminates. Regarding concrete executions, two situations are possible: 


— If B.y is initially 6, then the left abstract execution can induce a deadlock- 
free concrete one: after the first concrete reduction, A.x is 5, and B.y is still 
6, so A.x==5 ^ B.y==6 is true (i.e., case A, unanimity), enabling the sequel. 

— If B.y is initially not 6, then both abstract executions cannot induce a 
deadlock-free concrete one: after the first concrete reduction, A.x is 5, but 
B.y is still not 6, so both A.x==5 A B.y==6 and 7A.x==5 ^A —B.y==6 are false 
(i.e., case C, non-unanimity), disabling the sequel and causing a deadlock. 


This example shows that we need a technique to infer that B.y must initially be 
6 to ensure unanimity for deadlock freedom; we present it in the next subsection. 

We end this subsection with an extension of v for if/while-statements; it is 
induced by the rules in Fig. 6b (page 12). There is a third aim now (cf. page 12): 


3. Rules [V-IF] and [“/-WHILE] ensure that every role (in R) has its own con- 
junct in every multiparty condition. The idea is that every role always needs 
to know which branch to enter, so it must participate in every decision. 5:6 


5 Well-formedness (every role has its own conjunct) and the grammar of if/while-state- 
ments (every conjunct is local to a role) are jointly similar to the variable-knowledge- 
condition of Neykova et al. [43]; they ensure that formulas are projectable (Fig. 10b). 

6 It is possible to encode choices in which only a few—not all—roles participate using 
extra variables; the idea is outlined at the end of Appx. A [39]. However, this encoding 
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5.2 Predicate Transformer 


We proceed with an extension of ọ for if/while-statements; it is defined by 
the equations in Fig. 7b (page 13). As before, the definition of for if/while- 
statements is an adaptation of the definition of wp (i.e., Dijkstra’s original pred- 
icate transformer [26]), but it differs on crucial points as well. More precisely: 


— For if A{er}rer Gi Go, the definition of @ has three conjuncts. The first 

(resp. second) conjunct states that if every e, is true (resp. false), then the 
precondition of the then-branch (resp. else-branch) is true. This is similar to 
the definition of wp, and it includes case A (resp. B) on page 16. 
The third conjunct states that every e,, must imply every er, (i.e., they are 
either all true or all false); this is new relative to the definition of wp, and it 
excludes case C on page 16. (i.e., if the precondition computed by ọ is true, 
then case C will never arise). The following proposition makes this precise. 


Proposition 2. [A{er, > er tri roer] C [Ater}rer V Aer rerl- 


Thus, accumulates logical requirements not only to ensure the truth of the 
postcondition for functional correctness (i.e., the first and second conjunct), 
but also to ensure unanimity for deadlock freedom (i.e., the third conjunct). 
Figure 8c (page 14) shows an example, featuring the same global program 
as G on page 17: if ¢1[5/A.x] is true (to ensure the truth of x) and B.y is 
6 (to ensure unanimity), then after executing the global program, y is true. 
Thus, @ mechanises our reasoning about G on page 17. 


— For while A{er}rer {Winv} G, the definition of p has an “outer conjunction” 
and an “inner conjunction”. The inner conjunction is similar to ọ for if-state- 
ments: either every e, and the precondition of the body are true, to (re-)enter 
the loop, or every ~e, and the postcondition are true, to exit it. 

The second outer conjunct states that always (i.e., in every state, i.e., before 
and after executing the body), if the invariant is true, then the inner con- 
junction is true; the first outer conjunct states that the invariant is indeed 
true (i.e., before executing the body). This is similar to the definition of wp. 


5.3 Deadlock Freedom and Functional Correctness 


To extend the main theorem for global programs (Thm. 3, page 16) to cover 
if/while-statements, we need to extend the auxiliary lemmas (Lem. 1-4, page 15 
onwards); the proof of the theorem relies on the lemmas and is the same. 


Lemma 5. Lemmas 1-4 hold for this section’s extension. 


Proof. For Lem. 1 there are no new cases (i.e., no new termination rules in 
Fig. 4b). For Lems. 2-3, the new cases (i-e., new reduction rules in Fig. 4b) can 
be proved directly. For Lem. 4, the new cases (i.e., new well-formedness rules 
in Fig. 6b) can be proved using Prop. 2, to establish that rule [>-IF1] or rule 
|—>-IF2] is applicable in such a way that S € [y] holds as well. 


is not always practical/user-friendly. We therefore aim to offer “native” support for 
such choices too, using a form of merging [8,9,10]; see also Appx. D [39]. 
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Theorem 4. Theorem 8 holds for this section’s extension. 


Proof. The same as the proof of Thm. 3, using Lem. 5 instead of Lems. 1—4. 


6 Global Programs: Non-Blocking If/While-Statements 


In the previous section, we extended the base calculus of global programs with 
blocking if/while-statements; they require roles to make private decisions together 
(i.e., at the same time), using barriers. In this section, we extend the base calcu- 
lus also with non-blocking if/while-statements; they allow roles to make private 


decisions alone (i.e., at their own pace). This is often preferable. 


6.1 Syntax and Semantics 


Recall that G denotes a universe of global programs, ranged over by G; it is 


induced by the following extended grammar: 
G u= --- (page 16) | if \{er}rer Gilr, G2lr, | while A {er}rer {Vinv} Glo 


Informally, the new grammar elements have the following meaning:’ 


— Global program if NA{er}rer Gilr, G2|rR, prescribes a non-blocking con- 
ditional choice of G and Gz. It relies on similar principles as the blocking 
version; notably, the same cases A, B, C on page 16 are applicable. 

The key difference with the blocking version is that roles make private deci- 
sions alone (i.e., at their own pace), without using synchronisation barriers. 
Intuitively, in operational terms, this means that for every role r: first, it pri- 
vately evaluates its own conjunct er; next, it immediately enters a branch. 
To accommodate this, extra syntactic bookkeeping—in the form of the “|p,” 
and “|p,” notation—is needed to keep track of roles’ decisions. 

More precisely, at any time, R; contains every role that has already made a 
private decision to enter G;. Initially, both Rı and Rə are empty. In case A 
(resp. B), Ri (resp. R2) eventually becomes “full” and contains all roles, while 
Ro (resp. R,) always remains empty. In case C, both R, and Rə eventually 
become non-empty, but they always remain “non-full” as well. 

A non-blocking if-statement can terminate when all roles have made a private 
decision and every entered branch can terminate. 


— Global program while /A{er}rer {Winv } Glo prescribes a non-blocking 
conditional loop of G. The idea is similar to if A{er}rer Gilr, Galera, 
except that no extra bookkeeping is needed (i.e., a fixed in “|g”): non- 
blocking while-statements will be unfolded into non-blocking if-statements. 
(The reason for the seemingly redundant “|g” notation is to give non-blocking 
while-statements a different grammatical form than blocking ones.) 


T Blocking and non-blocking if/while-statements have different syntax. This makes it 
possible to mix the blocking and non-blocking versions in the same global program 
(we have not encountered a compelling use case for this yet, though). 
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Formally, for non-blocking if/while-statements, | and > are induced by the rules 
in Fig. 4c (page 10). Rule [|-NIF] states that if every role has made a private 
decision (left premise), and if G and G2 can terminate when at least one role 
has entered it (middle and right premise), then the non-blocking if-statement can 
terminate. The effect of the “R; 4 Ø conditions is that a non-entered branch 
does not need to be able to terminate for the whole if-statement to be able to 
terminate. We note that rule {|-NIF] also covers the case in which both Rı and 
Ry are non-empty, which should never have happened in the first place; shortly, 
we will rule it out using well-formedness and the predicate transformer. 

Rules |>-NIF1] and |>-NIF2] state that if r has not made a private decision 
yet (left premise), then the non-blocking if-statement can reduce by executing 
one. For instance, when G = A.x:=5 ; if (A.x==5 A B.y==6) B.y:=7|g skip|g and 
w = A.x==5 A B.y==6, the following two abstract executions are possible: 


true, Ass: BES nasa 

G A.x:=5, e 1 > @ 1 > @ y= > skip ; if Y skip|{a,B} skipļø L 
true, A-4755, = were 

GAz=s, 14, > skip ; if y B.y:=7|a} skip] +p} 


First, G reduces twice by executing an assignment and a private decision at Alice 
alone to enter the then-branch (both executions); next, it reduces by executing 
a private decision at Bob alone to enter the then-branch (top execution) or 
else-branch (bottom); next, in the latter case, it is stuck. Regarding concrete 
executions, if B.y is initially not 6, then a deadlock-free one does not exist: the 
top abstract execution cannot be enriched (i.e., after the second reduction, the 
sequel is disabled); the bottom abstract execution can be enriched but gets stuck. 
We note that unlike rules [>-IF1] and |—-IF2], there is no direct correspondence 
between rules [—-NIF1] and |[+-NIF2] and cases A, B, C on page 16. 

Rules |[—>-NIF3] and [>-NIF4] state that if G1 or Gz can reduce by executing 
y (left premise), and if the subjects of y have previously entered G4 or Go (right 
premise), then the non-blocking if-statement can reduce accordingly. This means 
that global actions in the branches can be executed already before all private 
decisions have been made, out-of-order. We note that the set differences in the 
premises of these rules are needed, because in general (but undesirably), Ri and 
Rə may overlap; shortly, we will rule out this possibility using well-formedness 
and the predicate transformer. For instance, with the same G as above, also the 
following abstract execution is possible (due to rule [>-SEQ2] as well): 


true, B.y==6, true, A.x==5 
a YB} B.y:=7 fay? 
G A.x: b, e 1 se y so 1 


> skip ; if y% skip|,q.p; skip|9 4 


Rule |+-NWHILE| unfolds the non-blocking while-statement. 

We end this subsection with an extension of v for non-blocking if/while- 
statements; it is induced by the rules in Fig. 6c (page 12). There is a fourth aim 
now (cf. page 12 and page 17): 


4. Rule [V-NIF] ensures that case A or B on page 16 applies, but not case C: 
when roles make private decisions alone, they must still be unanimous. 
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6.2 Predicate Transformer 


For non-blocking if/while-statements, is defined by the equations in Fig. 7c 
(page 13). It is based on the extension for the blocking variants in Fig. 7b: 


— For if Afe,}rer Gilr, Golr,, the definition of @ has four cases. 
If R, and Rə are both empty, then no role has made a private decision 
to enter a branch yet, so the precondition is the same as for blocking if- 
statements (i.e., either choice is still possible). This shows that blocking 
and non-blocking if-statements are functionally equivalent in the following 
sense: to ensure that the same postcondition is true in the end, the same 
precondition must be true in the beginning. 
If Ri and Rj are empty and non-empty, then the roles in Rj have privately 
decided to enter G;. Thus, the precondition of Gj must be true. Moreover, 
to ensure that the remaining roles in R \ R; will privately make the same 
decision to enter Gj, their conjuncts must be all true (if j=1) or all false (if 
j=2) as well. In this way, cases A and B on page 16 are included. 
If Rı and Rə are both non-empty, then roles have privately decided to enter 
both G, and G2, which should never have happened. Thus, the precondition 
is false. In this way, case C on page 16 is excluded. 


— For while A{er}rer {Winv} Glo, no role has made a private decision to 
(re)enter the loop or exit it yet, so the precondition is the same as for block- 
ing while-statements. When the first role privately decides, the non-blocking 
while-statement is unfolded into a non-blocking if-statement. 


6.3 Main Theorem: Deadlock Freedom and Functional Correctness 


To extend the main theorem for global programs (Thm. 3, page 16) to cover non- 
blocking if/while-statements, we need to extend the auxiliary lemmas (Lem. 1-4, 
page 15 onwards); the proof of the theorem relies on the lemmas and is the same. 


Lemma 6. Lemmas 1-4 hold for this section’s extension. 


Proof. For Lem. 1, the new case (i.e., rule [{-NIF] in Fig. 4c) can be proved 
using Vp(G), to rule out the degenerate case that a non-blocking if-statement 
with the “empty” multiparty condition A{e,},e9 can terminate. For Lem. 2, the 
new cases (i.e., new reduction rules in Fig. 4c) can be proved directly. For Lem. 3, 
the new cases (i.e., new reduction rules in Fig. 4c) can be proved using Vr(G) 
and [(G, x)] # 0 (first and second premise of Lem. 3), to establish that Rı or 
Ro is empty before the reduction, and remains empty after it (i.e., case C on 
page 16 never arises). For Lem. 4, the new cases (i.e., new well-formedness rules 
in Fig. 6b) can be proved using Prop. 2, to establish that rule [>-NIF1] or rule 
|-NIF2] is applicable in such a way that S € [y] holds as well. 


Theorem 5. Theorem 8 holds for this section’s extension. 


Proof. The same as the proof of Thm. 3, using Lem. 6 instead of Lems. 1—4. 
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7 Local Programs and Projection 


In the previous sections, to cover the upper half of Fig. 1, we incrementally 
presented a calculus of global programs with blocking and non-blocking if/while- 
statements. In this section, to cover the bottom half, we present a complementary 
calculus of local programs and a projection function. 


7.1 Syntax and Semantics 


Let A and L denote universes of local actions and local programs, ranged over 
by A and G; they are induced by the following grammar: 


À n= gyi=e | pq!e | pare | ee | T 
L := skip | » | L1;L2 | L || Lo | 
Rif e Lı Lz | R.whilee L | if eln Lılr, Llr, | while eln Llp 


Informally, these grammar elements have the following meaning: 


— Local action q.y:=e models an assignment, as before. 

— Local actions pq!e and pq?q model a send and a receive of the value of 
expression e at role p into variable y at role q. 

— Local action i£, with i € {1,2}, models a private decision at role r, as 
part of a collection of private decisions at all roles in R together. 


— Local action t models a delay (i.e., passage of time in which a role sits idle). 


— The local programs have largely the same meaning as their global counter- 
parts. There are two differences. First, the extra “R.” notation in blocking 
if/while-statements allows a role to know which other roles to wait for be- 
fore entering a branch. Second, the extra “|n” notation in non-blocking if/ 
while-statements allows a role to delay n times (motivated below). 


Formally, the abstract termination and reduction relations for local programs 
are induced by the same rules as in Fig. 4 (page 10), mutatis mutandis, except: 


— In the rules for if/while-statements: every “N {er}rer” and “A{ne,}rer” is 
replaced with “e” and “~e”, while every “i?” and “it?” is replaced with ee 
and “i{"}” such that e is local to r. See Appx. B [39] for details. 

— There is an extra rule for non-blocking if-statements to execute a delay and 


decrement n if n>0 (motivated below, when discussing projection). 


Let R — L denote a universe of families of local programs (i.e., partial func- 
tions roles to local programs), ranged over by £. Informally, £ prescribes a 
parallel composition of the k local programs in its image £(r,),...,£(rx). 
Formally, the abstract termination and reduction relations are induced by the 
rules in Fig. 9. They state that an assignment and a delay are executed alone, 
while a send-receive pair and a collection of private decisions are executed to- 
gether. We note that for n=1, the bottom-left rule to execute if} covers 
the case of non-blocking if/while-statements. Furthermore, the mechanisms by 
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ie spqie „pa? 
Dib te Ll EES L, eo EE a, ela EES L, 
L} LEETS, Ljig L pe Cip Lh, q L! 
rq hPL Tn} eS Ce ELERS rn} = 
£(rs) E Ln oo Llr) EHT kn A) ÉS Lr 
or girl Tn} z 
p EN Aint T > Liri m DS sock res Lh] L £3 Lir L] 
Fig. 9: Abstract operational semantics of families of local programs. 
L[r + Li] denotes the update of the image of r in £ to Li. 
: R 
qy:=elr= pqie if:r=p i r= 
qgy:=e iffr=q perqgylr=<pqry ifr=q if if: reR 
T otherwise T otherwise t otherwise 


(a) Global actions 


skip | r = skip NA{es}rer fr = 
GioG2fr=(Gi[r)o (Geir) er itre R 
if Y Gi Go | r = R.if (y |r) (Gi fr) (G2 fr) true otherwise 


while Y {Winv} G | r = R.while (y [ r) (G |r) 
if  Gilr, Goļra |r = if (Y | r)en mmurutry (Gi TT) intr} (G2 PT) Rang 
while Y {Winv} Gla | r = while (4 Ì r)lirytry (G T r)løo 


(b) Global programs. Let o € {;, ||} and r € R. 


Fig. 10: Decomposition of global actions /programs into local actions/programs 


which “togetherness” arises (i.e., channels and barriers) are left implicit; they 
are implementation details. The concrete termination and reduction relations 
are induced by the same rules as in Fig. 5 (page 11), mutatis mutandis. 

To decompose global actions and programs into local ones, let [| denote a pro- 
jection function; it is induced by the equations in Fig. 10. In words, | consumes 
a global program G and a role r as input, and it produces a local program G [ r 
as output. The idea is that | is sound and complete: roughly, G can terminate 
or reduce by executing y if, and only if, G | r can similarly terminate or reduce 
by executing y | r. The interesting cases of Fig. 10 are as follows: 


— For y (any global action), there are basically two possibilities. If r is a subject 
of y, then y | r is the contribution of r to y (i.e., an assignment remains 
an assignment; a communication is split into a separate send and receive; a 
collection of private decisions is split into separate ones). If r is not a subject 
of y, then y Îr is a delay (i.e., r sits idle, without contributing to y). 


— For G = if Y Gi|r, G2|r,, the definition of | is most complicated. We explain 
it from the perspective of soundness. There are three situations to consider. 
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First, suppose that G reduces by executing a global action y in which r does 
participate. To ensure that G [| r can similarly reduce by executing y |r, it 
will be sufficient to register in G | r whether or not r has already entered a 
branch in G (and which one). This is achieved by “|r n{r}” and “|r ngr}. 

Second, suppose that G reduces by executing a global action it”! in which r 
does not participate, using rule [-NIF1] or rule [>-NIF2], so another role 
ro enters G or G2. To ensure that Gr can similarly reduce by executing T, 
it will be sufficient to register in G | r the number of roles that have not yet 
entered a branch in G, excluding r. This is achieved by “|)p\(R2,URsU{r})|"" 

Third, suppose that G reduces by executing a global action y in which r 
does not participate using rule [+-NIF3] or rule [>-NIr4]. To ensure that 
G Îr can similarly reduce, no additional information needs to be registered. 


7.2 Operational Equivalence 


Informally, our main theorem for local programs and projection is as follows: 
if the global program is well-formed, and if the computed precondition is true 
in the initial state, then operational equivalence is provided. In the rest of this 
section, we first present auxiliary lemmas; next, we present the main theorem. 

The first lemma pertains to soundness of |. It states that if G is well-formed 
and can terminate or reduce, then G | r can similarly terminate or reduce. 


Lemma 7. 


1. If VR(G) andr € R and G}, then (Gf r)J. oo 
2. If /p(G) andr € R and G 2% G', then (Gtr) 224 (œ Tr). 


Proof. By induction on the derivation of G4 (item 1) and G L a (item 2). 
The interesting cases are rules [>-IF1], [>-IF2], |[—>-WHILE1], and [>-WHILE2]: 
in those cases, we use premises Vp(G) and r € R to establish that r must have 
its own conjunct in the multiparty condition, so it must contribute to y. 


The second lemma pertains to completeness of |. It states that if G is well- 
formed, and if G/r can terminate, then G can similarly terminate. Furthermore, 
it states that if G is well-formed, and if every G |Ì r can reduce by executing y |r, 
for every subject r of y, then G can similarly reduce. 


Lemma 8. 


1. If Vr(G) and (Gf r)J, then G}. 
2. If /a(G) and (G | r) 5 L/, for every r € subj(y), then G 2, G! and 
pr =witrand Li =G' tr, for every r, for some w,G’. 


Proof. By induction on the derivation of (G | r)J (item 1) and the derivations 
of (G |r) “4s L., for every r € subj(y) (item 2). The interesting cases are 
[—-PAR1] and [>-PAR2]: we use premise Vp(G) to establish that either the LHS 
is reduced in every G |r, or the RHS (otherwise, there is no unique G”). 
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Thus, the previous lemmas show that a global program and its family of projec- 
tions can simulate each other’s behaviour, at the abstract “top layer” of the oper- 
ational semantics. The following theorem shows that this result can be extended 
to the concrete “bottom layer”: it states that if G is well-formed, and if b(G, x) is 
true in S, then (G, S) and ({G[r},er, S) are weakly bisimilar (e.g., [30]), denoted 
with ~. This means that (G, S) and ({G[ r}rer, S) can coinductively simulate 
each other’s behaviour, modulo delays (i.e., operational equivalence). 


Theorem 6. If vR(G) and S € [b(G,x)], then (G,S) ~ ({G | r}rer, S). 


Proof. We prove the theorem using Lems. 7-8 and Fig. 5. See Appx. C [39] for a 
more detailed overview of the steps, including a weak bisimulation relation. 


8 Conclusion 


We presented a new theory of choreographic programming. It supports for the 
first time: construction of distributed systems that require decentralised de- 
cision making; analysis of distributed systems to provide not only deadlock 
freedom but also functional correctness. Both contributions are enabled by a 
single new technique, namely a predicate transformer for choreographies. 

The following corollary summarises our main theorems (Thms. 3-6): 


Corollary 1. If global program G (with multiparty conditions in if/while-state- 
ments) is well-formed, and if precondition @(G, x) is true in initial state S, then 
the family of projections ({G/r},cr,S) is deadlock-free and functionally-correct. 


For instance, in Sect. 2, we presented a deadlock-free global program for leader 
election; in Appx. E [39], we demonstrate how to prove its functional correctness; 
by Cor. 1, these properties are preserved by projection. 

We implemented the new theory on top of the existing VerCors tool for 
deductive verification [4]; we present this implementation elsewhere. 

In future work, we aim to extend the new theory with: (1) asynchronous 
communication; (2) a new version of merging [8,9,10] for decentralised decision 
making (see also footnote 6); (3) more flexible interleaving by relaxing the dis- 
jointness requirement for interleaving to support shared variables (e.g., using 
concurrent separation logic [6,44]). 


Acknowledgments Funded by the Netherlands Organisation of Scientific Re- 
search (NWO): 016.Veni.192.103. 
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Abstract. This paper shows that the z-calculus with implicit matching 
is no more expressive than CCS}, a variant of CCS in which the result of 
a synchronisation of two actions is itself an action subject to relabelling 
or restriction, rather than the silent action r. This is done by exhibiting 
a compositional translation from the z-calculus with implicit matching 
to CCS, that is valid up to strong barbed bisimilarity. 

The full z-calculus can be similarly expressed in CCS, enriched with the 
triggering operation of MEIJE. 

I also show that these results cannot be recreated with CCS in the rôle 
of CCS,, not even up to reduction equivalence, and not even for the 
asynchronous z-calculus without restriction or replication. 

Finally I observe that CCS cannot be encoded in the z-calculus. 


1 Introduction 


The z-calculus [23,24,22,33] has been advertised as an “extension to the process 
algebra CCS” [23] adding mobility. It is widely believed that the z-calculus has 
features that cannot be expressed in CCS, or other immobile process calculi—so 
called in [27]—such as ACP and CSP. 


“the m-calculus has a much greater expressiveness than CCS” 
[Sangiorgi [32]] 
“Mobility — of whatever kind — is important in modern computing. 
It was not present in CCS or CSP, [...] but [...] the a-calculus [...] 
takes mobility of linkage as a primitive notion.” [Milner [22]] 


The present paper investigates this belief by formally comparing the expressive 
power of the z-calculus and immobile process calculi. 

Following [10,11] I define one process calculus to be at least as expressive as 
another up to a semantic equivalence ~ iff there exists a so-called valid trans- 
lation up to ~ from the other to the one. Validity entails compositionality, and 
requires that each translated expression is ~-equivalent to its original. This con- 
cept is parametrised by the choice of a semantic equivalence that is meaningful 
for both the source and the target language. Any language is as expressive as 
any other up to the universal relation, whereas almost no two languages are 
equally expressive up to the identity relation. The equivalence ~ up to which a 
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translation is valid is a measure for the quality of the translation, and thereby 
for the degree in which the source language can be expressed in the target. 

Robert de Simone [34] showed that a wide class of process calculi, including 
CCS [20], CSP [6], ACP [4] and SCCS [18], are expressible up to strong bisimi- 
larity in MEIJE [1]. In [8] I sharpened this result by eliminating the crucial rôle 
played by unguarded recursion in De Simone’s translation, now taking aprACP R 
as the target language. Here aprACP x is a fragment of the language ACP of [4], 
enriched with relational relabelling, and using action prefixing instead of general 
sequential composition. It differs from CCS only in its more versatile communica- 
tion format, allowing multiway synchronisation instead of merely handshaking, 
in the absence of a special action 7, and in the relational nature of the relabelling 
operator. The class of languages that can be translated to MEIJE and aprACP R 
are the ones whose structural operational semantics fits a format due to [34], 
now known as the De Simone format. They can be considered the “immobile 
process calculi” alluded to above. The z-calculus does not fit into this class—its 
operational semantics is not in De Simone format. 

To compare the expressiveness of mobile and immobile process calculi I first 
of all need to select a suitable semantic equivalence that is meaningful for both 
kinds of languages. A canonical choice is strong barbed bisimilarity [26,33]. Strong 
barbed bisimilarity is not a congruence for either CCS or the z-calculus, but it is 
used as a semantic basis for defining suitable congruences on languages [26,33]. 
For CCS, the familiar notion of strong bisimilarity [19] arises as the congruence 
closure of strong barbed bisimilarity. For the a-calculus, the congruence closure 
of strong barbed bisimilarity yields the notion of strong early congruence, called 
strong full bisimilarity in [33]. In general, whatever its characterisation in a par- 
ticular calculus, strong barbed congruence is the name of the congruence closure 
of strong barbed bisimilarity, and a default choice for a semantic equivalence 
[33]. 

My first research goal was to find out if there exists a translation from the 
m-calculus to CCS that is valid up to strong barbed bisimilarity. The answer 
is negative. In fact, no compositional translation of the a-calculus to CCS is 
possible, even when weakening the equivalence up to which it should be valid 
from strong barbed bisimilarity to strong reduction equivalence, and even when 
restricting the source language to the asynchronous z-calculus [5] without re- 
striction and replication. This disproves a result of [3]. 

My next research goal was to find out if there is a translation from the m- 
calculus to any other immobile process calculus, and if yes, to keep the target 
language as close as possible to CCS. Here the answer turned out to be positive. 
How close the target language can be kept to CCS depends on which version 
of the z-calculus I take as source language. My first choice was the original m- 
calculus, as presented in [23,24], as it is at least as expressive as its competitors. It 
turns out, however, that the matching operator [a=y]P of [23,24] is the source of 
a complication. The book [33] merely allows matching to occur as part of action 
prefixing, as in [c=y]u(z).P or [v=y]tiv.P. I call this implicit matching. Matching 
was introduced in [23,24] to facilitate complete equational axiomatisations of the 
m-calculus, and [33] shows that for that purpose implicit matching is sufficient. 
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To obtain a valid translation from the a-calculus with implicit matching 
(henceforth called mym) to an upgraded variant of CCS, the only upgrade needed 
is to turn the result of a synchronisation of two actions into a visible action, 
subject to relabelling or restriction, rather than the silent action 7. I call this 
variant CCS}, where y is a commutative partial binary communication function, 
just like in ACP [4]. CCS, is a fragment of aprACPr, which also carries a 
parameter y. If y(a,b) = c, this means that an a-action of one component in 
a parallel composition may synchronise with a b-action of another component, 
into a c-action; if y(a,b) is undefined, the actions a and b do not synchronise. 
CCS can be seen as the instance of CCS, with 7(@,a) = T, and y undefined 
for other pairs of actions. But as target language for my translation I will need 
another choice of the parameter y. 

An important feature of ACP, which greatly contributes to its expressiveness, 
is multiway synchronisation. This is achieved by allowing an action y(a,b) to 
synchronise with an action c into y(y(a,b),c). This feature is not needed for 
the target language of my translations. So I require that y(y(a,b),c) is always 
undefined. 

To obtain a valid translation from the full 7-calculus, with an explicit match- 
ing operator, I need to further upgrade CCS, with the triggering operator of 
MEIJE, which allows a relabelling of the first action of its argument only. 

By a general result of [11], the validity up to strong barbed bisimilarity of 
my translation from mm to CCS, (and from 7 to Ccsie) implies that it is 
even valid up to an equivalence on their disjoint union that on m coincides with 
strong barbed congruence, or strong early congruence, and on cosine is the 
congruence closure of strong barbed bisimilarity under translated contexts. The 
latter is strictly coarser than strong bisimilarity, which is the congruence closure 
of strong barbed bisimilarity under all ccstris contexts. 

Having established that mrm can be expressed in CCS}, the possibility re- 
mains that the two languages are equally expressive. This, however, is not the 
case. There does not exists a valid translation (up to any reasonable equivalence) 
from CCS—thus neither from CCS.,—to the z-calculus, even when disallowing 
the infinite sum of CCS, as well as unguarded recursion. This is a trivial conse- 
quence of the power of the CCS renaming operator, which cannot be mimicked 
in the z-calculus. Using a simple renaming operator that is as finite as the suc- 
cessor function on the natural numbers, CCS, even without infinite sum and 
unguarded recursion, allows the specification of a process with infinitely many 
weak barbs, whereas this is fundamentally impossible in the z-calculus. 


2 CCS 


CCS [19] is parametrised with a sets K of agent identifiers and & of visible 
actions. The set & of co-actions is £ := {a | a€ A}, and Z := AUD is 
the set of labels. The function ~ is extended to -Z by declaring a = a. Finally, 
Act := £ wW {rT} is the set of actions. Below, a, b, c, ... range over Y and a, 8 
over Act. A relabelling is a function f: 2+ satisfying f(a) = f(a); it extends 
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Table 1. Structural operational semantics of CCS 


a P; => P; ; 
a.P => P — (7 € I) 
Vier Pi > P; 
PS P PSP, Q&Q Q> 
PQ => PQ PI — PQ’ P| => P\Q’ 
P-S P 3 P-S P P-P J 
—""*___ (ag LUL) jin —- (AFP) 
P\L + P\L Pif] #2 Pif] ASP 


to Act by f(T) := r. The class Tecs of CCS terms, expressions, processes or 
agents is the smallest class! including: 


a.P for a € Act and P € Tccs prefixing 

dic Pi for I an index set and P; € Tecs choice 

P|Q for P,Q € Tecs parallel composition 
P\L for LC Y and PE Tecs restriction 

Pf) for f arelabelling and P € Tecs relabelling 

A fr AEK recursion. 


One writes P; + P> for $ ;cr Pi when I={1,2}, and 0 when I = 9. Each agent 
identifier A € K comes with a unique defining equation of the form A® $ p with 
P € Tccs. The semantics of CCS is given by the labelled transition relation 
—+ C Tocs x Act x Tocs. The transitions P 5 Q with P,Q €Tccs and a€ Act 
are derived from the rules of Table 1. 

Arguably, the most authentic version of CCS [20] features a recursion con- 
struct instead of agent identifiers. Since there exists a straightforward valid tran- 
sition from the version of CCS presented here to the one from [20], the latter 
is at least as expressive. Therefore, when showing that a variant of CCS is at 
least as expressive as the a-calculus, I obtain a stronger result by using agent 
identifiers. 


3 CCS, 


CCS, has four parameters: the same set K of agent identifiers as for CCS, an 
alphabet æ of visible actions, with a subset .Y C æ of synchronisations?, and a 


1 CCS [19,20] allows arbitrary index sets J in summations )>,_,P;. As a consequence, 
Tccs is a proper class rather than a set. Although this is unproblematic, many 
computer scientists prefer the class of terms to be a set. This can be achieved by 
choosing a cardinal « and requiring the index sets I to satisfy |I| < «. To enable 
my translation from the z-calculus to ccs", k should exceed the size of the set of 
names used in the z-calculus. 

2 These have been added solely to prevent multiway synchronisation. 
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partial communication function y : (A\Z)? = Sf U {r}, which is commutative, 
i.e. y(a,b) = y(b, a) and each side of this equation is defined just when the other 
side is. Compared to CCS there are no co-actions, so Act := & W {T}. 

The syntax of CCS, is the same as that of CCS, except that parallel compo- 
sition is denoted || rather than |, following ACP [4,2]. This indicates a semantic 
difference: the rule for communication in the middle of Table 1 is for CCS, 


replaced by P—*5 p Q Q' 
> (v(a, b) = ©). 
PQ — PQ’ 
Moreover, relabelling operators f : Æ — Act are allowed to rename visible 


actions into 7, but not vice versa.? They are required to satisfy c € F > f(c) € 
F U {rT}. These are the only differences between CCS and CCS,. 


4 Strong barbed bisimilarity 


The semantics of the a-calculus and CCS can be expressed by associating a 
labelled or a barbed transition system with these languages, with processes as 
states. Semantic equivalences are defined on the states of labelled or barbed 
transition systems, and thereby on m- and CCS processes. 


Definition 1. A labelled transition system (LTS) is pair (S,—) with S a class 
(of states) and > C S x Ax S a transition relation, for some suitable set of 
actions A. 


I write P > Q for (P,a,Q) € >, P -> for JQ. P  Q, and P for its 
negation. The structural operational semantics of CCS presented before creates 
an LTS with as states all CCS processes and the transition relation derived from 
the operational rules, with A := Act. 


Definition 2. A strong bisimulation is a symmetric relation Z on the states of 
an LTS such that 


~ if PZ Q and P > P’ then IQ. QS QAP BQ. 


Processes P and Q are strongly bisimilar—notation P & Q—if P Z Q for some 
strong bisimulation 2. 


As is well-known, & is an equivalence relation, and a strong bisimulation itself. 
Through the operational semantics of CCS}, strong bisimilarity is defined on 
CCS., processes. 


Definition 3. A barbed transition system (BTS) is a triple ($,+4,|) with S a 
class (of states), œ C Sx S a reduction relation, and | C S x B an observability 
predicate for some suitable set of barbs B. 


3 Renaming into 7 could already be done in CCS by means of parallel composition. 
Hence this feature in itself does not add extra expressiveness. 
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Table 2. The actions 


a | Kind O(a) | fn(a) 
Mr | Silent a 
Mzy | Free output x 
MZ(y)| Bound output} z jn 
x 
x 


Mey | Free input 
Mz(y) | Bound input 


U{x} Ly} 


One writes P|, for P € S and bE B when (P,b) € |. A BTS can be extracted 
from an LTS with 7 € A, by means of a partial observation function O: A — B. 
The states remain the same, the reductions are taken to be the transitions la- 
belled 7 (dropping the label in the BTS), and P|, holds exactly when there is a 
transition P > Q with O(a) = b. 

In this paper I consider labelled transition systems whose actions a€ A are of 
the forms presented in Table 2. Here x and y are names, drawn from the disjoint 
union of two sets Z and R of public and private names, and M is a (possibly 
empty) matching sequence, a sequence of matches |x=y| with x,y € Z W R and 
x Æ y. The set of names occurring in M is denoted n( M). In Table 2, also the 
free names fn(a) and bound names bn(a) of an action a are defined. The set 
of names of a is n(a) := fn(a) U bn(a). Consequently, also the actions Act of 
my instantiation of CCS} need to have the forms of Table 2. For the translation 
into barbed transition systems I take B := ZU Z, where Z := {a| a € Z}, and 
O(a) as indicated in Table 2, provided M = € and O(a) € B. 


Definition 4. A strong barbed bisimulation is a symmetric relation & on the 
states of a BTS such that 


— if P Z Q and P+ > P’ then IQ’. Q — Q'A P' 2 Q 
— and if P 2 Q and P}, then also QJ». 


Processes P and Q are strongly barbed bisimilar—notation P ©&® Q—if P Z Q 
for some strong barbed bisimulation @. 


Again, © is an equivalence relation, and a strong barbed bisimulation itself. 
Through the above definition, strong barbed bisimilarity is defined on all LTSs 
occurring in this paper, as well as on my instantiation of CCS.,. It can also be 
used to compare processes from different LTSs, namely by taking their disjoint 
union. 


5 The z-calculus 


The z-calculus [23,24] is parametrised with an infinite set M of names and, for 
each n € IN, a set of Kn of agent identifiers of arity n. The set T, of -calculus 
terms, expressions, processes or agents is the smallest set including: 
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0 inaction 

TEP for PET, silent prefix 

zy.P for x,y EN and P € T, output prefix 
x(y).P for x,y EN and P € T, input prefix 

(vy) P for y EN and PET, restriction 

[r=y]P for x,y EN and PET, match 

P|Q for P,Q € Tr parallel composition 
P+Q for P,Q ET, choice 


Aly, Yn) for AE Kn and yy EN defined agent 


The order of precedence among the operators is the order of the listing above. 
A process a.0 with a =7 or Zy or x(y) is often written a. 

n(P) denotes the set of all names occurring in a process P. An occurrence 
of a name y in a term is bound if it occurs in a subterm of the form x(y).P or 
(vy)P; otherwise it is free. The set of names occurring free (resp. bound) in a 
process P is denoted fn(P) (resp. bn(P)). 

Each agent identifier A € K, is assumed to come with a unique defining 
equation of the form 
A(ai,...,2n) É P 
where the names 2; are all distinct af 1 in(P VEI tioy tnk 

The z-calculus with implicit matching (mm) drops the matching operator, 
instead allowing prefixes of the form Mzy.P, Mx(y).P and Mr.P, with M a 
matching sequence. 

A substitution is a partial function 0: V—N such that V’\(dom(a)Urange(c)) 
is infinite. For Z = (x1, ..., £n), J= (Y1, -- -, Yn) E N”, {U£} denotes the substi- 
tution given by o(a;) = yi for 1 <i < n. One writes {Y/r} when n=1. 

For x € N, z|o] denotes o(x) if x € dom(c) and x otherwise; M|øo] is the 
result of changing each occurrence of a name x in M into z|o], while dropping 
resulting matches [y=y]. 

For a substitution g, the process Po is obtained from PET, by simultaneous 
substitution, for all x € dom(c), of z|o] for all free occurrences of x in P, with 
change of bound names to avoid name capture. A formal inductive definition is: 


Oc = 0 
(Mr.P)o = M[o|r.(Po) 
(Miy.P)o = M[o|z{a]yIo].(Po) 
(Ma(y).P)o = M[o|z[o](z).Pt7y}o) 
((vy)P)o = (wa (PI) 
(fe=y|P)o = [x[e]=ylo]](Po) 
(PIQ)o = (Po)|(Qo) 
(P+Q)o = (Po) + (Qo) 
jjo = A(ylo]) 


Aly 
where z is chosen outside fn((vy)P) U dom(o) U range(o); in case y ¢ dom(a) U 
range(o) one always picks z := y. 
A congruence is an equivalence relation ~ on T, such that P ~ Q implies 
T.P ~ T.Q, ty.P ~ Ty.Q, x(y).P ~ r(y).Q, (vy)P ~ (vy)Q, [e=y]P ~ [r=y]Q, 
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P|U ~ Q|U, U|P ~ U|Q, P+U ~ Q+U and U+ P~ U +Q. Let = be the 
smallest congruence on T, allowing renaming of bound names, i.e., that satisfies 
x(y).P = x(z).(P{7/y}) and (vy)P = (vz)(P{%/y}) for any z ¢ fn((vy)P). If 
P = Q, then Q is obtained from P by means of a-conversion. Due to the choice 
of z above, substitution is precisely defined only up to a-conversion. 

Note that P = Q implies that fn(P) = fn(Q), and also that Po = Qo for 
any substitution o. 


6 The semantics of the z-calculus 


Fig. 1. Semantics of the z-calculus 


Whereas CCS has only one operational semantics, the a-calculus is equipped 
with at least five, as indicated in Figure 1. The late operational semantics stems 
from [24], the origin of the z-calculus. It is given by the action rules of Table 3. 
These rules generate a labelled transition system in which the states are the 
m-calculus processes and the transitions are labelled with the actions 7, zy, z(y) 
and z(y) of Table 2 (always with M the empty string). Here I take Z := M and 
R := Ý. For mq, rule match is omitted. A process [z=y]a.P has no outgoing 
transitions, similar to 0. 

In [24] the late and early bisimulation semantics of the 7-calculus were pro- 
posed. 


Definition 5. A late bisimulation is a symmetric relation Z on m-processes 
such that, whenever P Z Q, a is either T or gy and z ¢ n(P) Un(Q), 


1. if P -© P’ then IQ’ with Q -> Q’ and P’ 2 Q', 
2. if P 25, P' then 3Q’Vy. Q £23 Q' A PHY} B Q'{y/z}, 
3. if P £2, P’ then IQ' with Q 22 Q' and P! 2 Q'. 


Processes P and Q are late bisimilar—notation P ~r, Q—if P & Q for some late 
bisimulation 2. They are late congruent—notation P~,Q—if P{Y/£} ~r Q{Y/2} 
for any substitution {Y/z}. 
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Table 3. Late structural operational semantics of the 7-calculus 


tau: output: input: 
P3 P zy.P 24 P 2(y).P 22s Play} (zg fn((vy)P)) 
sum: match: ide: 

a / = i yjz = : 
ae ar PUY OP (aia) P) 
Pag SP [z=2]P > P’ A) > P' 
par: com: close: 

PpP o P24 pP, Q223Q' PO P, QQ 
PIQ > PQ fn(Q) = 90 PIQ > P\Q'{y/z} P|Q => (vz)(P'|Q’) 
res: alpha-open: 

Pp p P *% P' z 
are ae (y g n(a)) alz) Ae B 
(vy) P © (vy) P' (vy) P 22 Pizy} f 


The rules sum, par, com and close additionally have symmetric forms, 
with the rôles of P and Q exchanged. 


Early bisimilarity (~g) and congruence (~g) are defined likewise, but with 
YygJQ' instead of 4Q’Vy. In [24,33] it is shown that ~z and ~g are congruences 
for all operators of the m-calculus, except for the input prefix. ~g and ~z are 
congruence relations for the entire language; in fact they are the congruence 
closures of ~z, and ~p, respectively. By definition, “z C ~g, and thus ~z C ~g. 


Lemma 1 ([24]). Let P = Q and bn(a) N n(Q) = 9. 
If P > P' then Q => Q' for some Q’ with P’ = Q’. 


This implies that = is a late bisimulation, so that = C ~z. 

In [25] the early operational semantics of the m-calculus is proposed, presented 
in Table 4; it uses free input actions xy instead of bound inputs z(y). This is also 
the semantics of [33]. The semantics in [25,33] requires us to identify processes 
modulo a-conversion before applying the operational rules. This is equivalent to 
adding rule alpha of Table 4. 

A variant of the late operational semantics incorporating rule alpha is also 
possible. In this setting rule alpha-open can be simplified to open, and likewise 
input to x(y).P 2), P, By Lemma 1, the late operational semantics with alpha 
gives rise to the same notions of early and late bisimilarity as the late opera- 
tional semantics without alpha; the addition of this rule is entirely optional. 
Interestingly, the rule alpha is not optional in the early operational semantics, 
not even when reinstating alpha-open. 


Example 1. Let P := £y|(vy)(a(z)). One has (vy)(a(z)) Lop (vy)O and thus 
P —+,, 0|(vy)0 by com. However, (vy)(x(z)) => g (vy)0 is forbidden by the side 
condition of res, so in the early semantics without alpha P cannot make a 7-step. 
Rule alpha comes to the rescue here, as it allows P=Zy|(vw)(x(z))—>,20|(vw)0. 
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Table 4. Early structural operational semantics of the z-calculus 


tau: output: early-input: 
T.P => P zy.P *4 P a(y).P > P{%/y} 
sum: match: ide: 

P-S Pp’ PP’ Plz} “> P’ |, def 
ae ee eT Te (aw) p) 
P+Q—%P' [v=a|P > P' A(y) > P’ 
par: early-com: early-close: 

P-S P a ) P= P, Q= Q PŽ P, Q 24 Q’ co ) 
PQ- PQ PQI PQPI PIR- (v (Pg) e) 
res: open: alpha: 

P5 Pp’ P = P' P=Q, 5 Q 
(vy)P = (vy) P' (vy) P = P' PSQ 


By the following lemma, the early transition relation —» p is completely deter- 
mined by the late transition relation —>az with alpha: 


Lemma 2 ([25]). Let PET, and 8 be 7, gy or z(y). 
-P p Q if PaL Q. 
-P p Q iff P 725.7 R for some R, z with Q = R{Y/2}. 
The early transition relations allow a more concise definition of early bisimilarity: 


Proposition 1 ([25]). An early bisimulation is a symmetric relation # on Ty 
such that, whenever P 2 Q and a is an action with bn(a) N (n(P) Un(Q)) = 9, 


— if P “yp P’ then IQ' with Q >p Q' and P! 2 Q'. 


Processes P and Q are early bisimilar if P Z Q for some early bisimulation 2. 


Through the general method of Section 4, taking Z := N and R := 9, 
a barbed transition system can be extracted from the late or early labelled 
transition system of the 7-calculus; by Lemmas 1 and 2 the same BTS is obtained 
either way. This defines strong barbed bisimilarity © on T,. The congruence 
closure of & is early congruence [33]. In [21] a reduction semantics of the t- 
calculus is given, that yields a BTS right away. Up to strong barbed bisimilarity, 
this BTS is the same as the one extracted from the late or early LTS. 

In [32] yet another operational semantics of the z-calculus was introduced, in 
a style called symbolic by Hennessy & Lin [16], who had proposed it for a version 
of value-passing CCS. It is presented in Table 5. The transitions are labelled with 
actions a of the form M, where M is a matching sequence and £ an action as in 
the late operational semantics. When «4y the matching sequence M prepended 
with [z=y] is denoted [x=y|M; however, {x=a|M simply denotes M. 

In the operational semantics of CCS, 7-actions can be thought of as reactions 
that actually take place, whereas a transition labelled a merely represents the 
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Table 5. Late symbolic structural operational semantics of the z-calculus 


tau: output: input: 
Mr.P “3 P Mīy.P 2% P. Maz(y).P 422 Pf{2/y} (z g fn((vy)P)) 
sum: symb-match: ide: 

a 7 a / yi 2 / : 

P—>P P sia P{U/Z} > P A(Z) def P) 
P+Q SP [z=y]P = pr A) > P' 
par: symb-com: symb-close: 

P-S P' Ss n ) P Mīy, P', Q Nv(z Q' P Mz(z), P’, Q Nv(z), Q' 
PIQ PQ \(Q)=Y pig MX, Pioi) PI E= (v2)(PQ) 
res: symb-alpha-open: 4 

a 1 May. 1 y T 
_ PP (: ¢ ) een P 2d in((vy)P’) 
(vy)P -> (vy) P! \O)) (vy) P MCS Pz} \y gnm) 


For the z-calculus, the blue Ms are omitted; for mm the purple rules. 


potential of a reaction with the environment, one that can take place only if 
the environment offers a complementary transition a. In case the environment 
never does an G, this potential will not be realised. A reduction semantics (as 
in [22]) yields a BTS that only represents directly the realised actions—the 7- 
transitions or reductions—and reasons about the potential reactions by defining 
the semantics of a system in terms of reductions that can happen when placing 
the system in various contexts. An LTS, on the other hand, directly represents 
transitions that could happen under some conditions only, annotated with the 
conditions that enable them. For CCS, this annotation is the label a, saying that 
the transition is conditional on an G-signal from the environment. As a result 
of this, semantic equivalences defined on labelled transitions systems tend to be 
congruences for most operators right away, and do not need much closure under 
contexts. 

Seen from this perspective, the operational semantics of the a-calculus of 
Table 3 or 4 is a compromise between a pure reduction semantics and a pure 
labelled transition system semantics. Input and output actions are explicitly 
included to signal potential reactions that are realised in the presence of a suit- 
able communication partner, but actions whose occurrence is conditional on two 
different names x and y denoting the same channel are entirely omitted, even 
though any -process can be placed in a context in which x and y will be identi- 
fied. As a consequence of this, the early and late bisimilarities need to be closed 
under all possible substitutions or identifications of names before they turn into 
early and late congruences. The operational semantics of Table 5 adds the con- 
ditional transitions that where missing in Table 3, and hence can be seen as a 
true labelled transition system semantics. 

In this paper I need the early symbolic operational semantics of the m- 
calculus, presented in Table 6. Although new, it is the logical combination of 
the early and the (late) symbolic semantics. Its transitions that are labelled 
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Table 6. Early symbolic structural operational semantics of the 7-calculus 


tau: output: early-input: 
Mr.P “4 P May.P **% P Mza(y).P 23 P{2/y} 
sum: symb-match: ide: 

Ps Pp’ PSP PIE > P' a Hee 
i = PADE (Ala) P) 
P+Q—>P' [=y] P -= pP’ A(y) > P’ 
par: b e-s-com: e-s-close: 

Ps pl n p Mz P,Q Novy Q! P Mi(z) P,Q Nvz Q' ( d ) 
PIQSPIQ\ =9 ) PQ EEES pig Pig EMSS (2)(P'1Q”) e) 
res: symb-open: alpha: 

PS P ua p sy pr toe ) P=0, O30 
(vy) P + (vy)P’ n(a) (vy)P Mey). pr \Y ¢ n(M) PAg 


with actions having an empty matching sequence are exactly the transitions of 
the early semantics, so the BTS extracted from this semantics is the same. 

For mym, rule symb-match is omitted, but tau, output and input carry the 
matching sequence M (indicated in blue). 


7 Valid translations 


A signature X is a set of operator symbols g, each of which is equipped with an 
arity n € IN. The set Ty of closed terms over X is the smallest set such that, 
for all g € X, 

Py,...,;PnhETs => g(Pi,...,Pr)eTs. 


Call a language simple if its expressions are the closed terms Ts over some 
signature X. The z-calculus is simple in this sense; its signature consists of the 
binary operators + and |, the unary operators T, zy., x(y)., (vy) and [xr=y] 
for x,y E€ N, and the nullary operators (or constants) O and A(y1,...,Yn) for 
A € Kn and y; E€ N. CCS is not quite simple, since it features the infinite choice 
operator. 

Let £ be a language. An n-ary £-context C is an £-expression that may con- 
tain special variables X1, ..., X,—its holes. For C an n-ary context, C[P),..., Pa] 
is the result of substituting P; for X;, for each i = 1,...,n. 


Definition 6. Let £’ and £ languages, generating sets of closed terms T and 
Tre. Let £’ be simple, with signature X. A translation from £L’ to £ (or an 
encoding from L’ into £) is a function J : Te > Te. It is compositional if 
for each n-ary operator g E€ X there exists an n-ary £-context C4 such that 
F(g(Pi,..-)Pn)) = CglF (Pi), <- F(Pa)l- 

Let ~ be an equivalence relation on Te U Tz. A translation JZ from £’ to 
£ is valid up to ~ if it is compositional and J (P) ~ P for each P € Tø. 
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The above definition stems in essence from [10,11], but could be simplified here 
since [10,11] also covered the case that £’ is not simple. Moreover, here I restrict 
attention to what are called closed term languages in [11]. 


8 The unencodability of m into CCS 


In this section I show that there exists no translation of the 7-calculus to CCS 
that is valid up to ©. I even show this for the fragment wh of the (asynchronous) 
m-calculus without choice, recursion, matching and restriction (thus only featur- 
ing inaction, action prefixing and parallel composition). 


Definition 7. Strong reduction bisimilarity, &,, is defined just as strong barbed 
equivalence in Definition 4, but without the requirement on barbs. 


I show that there is no translation of ai to CCS that is valid up to &,. As ©, 


is coarser than ©, this implies my claim above. It may be useful to read this 
section in parallel with the first half of Section 14. 


Definition 8. Let — be the smallest preorder on CCS contexts such that 
ier Ei = E; forall j € I, E|F «= E, EF « F, E\L « E, Elf] = E 
and A «= P for all A € K with A“ P. A variable X occurs unguarded in a 


context E if E «= X. 


If the hole X, occurs unguarded in the unary context E[ ] and U —> (resp. 
U —+>—>) then E[U] —> (resp. E[U] ==). 


Lemma 3. Let E| | be a unary and C[, ] a binary CCS context, and P,Q, 
P’,Q',U € Tocs. If E[C[P, Q]] — and U —> but neither E[C[P’, Q]] — nor 
E[C[P, Q’]| 4 nor E[U] 44-5, then C[P, Q] =>. 


Proof. Since the only rule in the operational semantics of 
CCS with multiple premises has a conclusion labelled 7, it 
can occur at most once in the derivation of a CCS transition. 
Thus, such a derivation is a tree with at most two branches, 
as illustrated at the right. Now consider the derivation of 
E|C|P, Q]] —. If none of its branches prods into the sub- 
process P, the transition would be independent on what is substituted here, 
thus yielding E[C[|P’,Q]] ++. Thus, by symmetry, both P and Q are visited 
by branches of this proof. It suffices to show that these branches come together 
within the context C, as this implies C[P, Q] —+. So suppose, towards a contra- 
diction, that the two branches come together in FE. Then E must have the form 
E\[E2[ ||E3[ ]], where the hole Xı occurs unguarded in £2, E3 as well as EF). 
But in that case E[U] ++—>, contradicting the assumptions. 


CCS proof trees 


Lemma 4. If D| , , ] is a ternary CCS context, Pı, P2, P3 € Tocs, and 
D[P,, P2, P3] +, then there exists an i € {1,2,3} and a CCS context E| ] such 
that D'[P] +> E[P] for any P € Tccs. Here D’ is the unary context obtained 
from D[ , , ] by substituting Pj for the hole X}, for all j € {1,2,3}, j Æ i. 
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Proof. Since the derivation of D[P,, P2, P3] ++ has at most two branches, one 
of the P; is not involved in this proof at all. Thus, the derivation remains valid 
if any other process P is substituted in the place of that P;; the target of the 
transition remains the same, except for P taking the place of P; in it. 


Theorem 1. There is no translation from at to CCS that is valid up to ©,. 


Proof. Suppose, towards a contradiction, that 7 is a translation from wh to 


CCS that is valid up to &,. By definition, this means that 7 is compositional 
and that 7 (P) ©, P for any T l-process P. 

As J is compositional, there exists a ternary CCS context D| , , | such 
that, for any a \-processes R,S,T, 

T (xv | (y). (RISIT)) = DIZ (R), 7 (S), 7 (T). 

Since Zv|x(y).(0|0|0) +> as well as J (zv|æ(y).(0ļ0|0)) &,„ Zv|x(y).(0|0|0), 
it follows that J (Zv|x(y).(0|0|0)) >, i.e., D[Z (0), 7 (0), 7 (0)] ++. Hence 
Lemma 4 can be a For simplicity I assume that i = 1; the other two 
cases proceed in the same way. So there is a CCS context E| | such that 
DIP, 7(0), 7 (0)] ++ E[P] for all CCS terms P. In particular, for all x4-terms R, 
F ((&v|x(y).(RO|0)) = DL A(R), 70), F(0)] + EL A(R). (1) 

I examine the translations of the 7-calculus expressions Zv|x(y).(R|0|0), for 
R € {yz|v(w), Oju(w), yz|0, T}. 

Since Zu|x(y).(7z|v(w)|0|0) +> and 7 respects 4,,, 


J (zv|x(y) ).(Gz|v(w)|0|0)) >=. 
In the same ae neither I (zv|æ(y ).(0lv(w)|0]0)) >=> (2) 
nor J (&v|a(y).(yz|0|0|0)) >>. 
Furthermore, since 7 respects ©, and there is no S € T, such that 
zv|x(y).(G2|v(w)|0|0) => SA, 
there is no S € Tocs with J (Zv|x(y).(Gz|v(w)|0|0)) ++ SA. (3) 
By (1) and (3), E[.7(yz|v(w))] =>. 


By (1) and (2), E[-7(0|v(w))] 4 and E[.7(yz|0)| 4 
Since 7 is compositional, there is a binary CCS context Cj[ , ] such that 
T(P\Q) =C|7(P), 7(Q)] for any P,Q € Tr. It follows that 


EC [7 (Gz), 7(v(w))] — 
ECF 0), 7(o(w))]] A 
E(C\F Gz), 7 (0)|] 4 


Moreover since rT —+, also U := J (T) =>, but, it is not the case that 
Zv|x(y).(7|0|0) +>—+—>, neither holds Flet (r|0|0)) ==>, and 
neither E[U] =-=. So by Lemma 3, 7(yz|uv(w)) = ee z), T(v(w))] =>, 
yet yz|v(w) 4. This contradicts the validity of Z up to ¥,. 
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9 A valid translation of mrm into CCS, 


Given a set M of names, I now define the parameters K, æ and y of the language 
CCS, that will be the target of my encoding. First of all, K will be the disjoint 
union of all the sets K,, for n € IN, of n-ary agent identifiers from the chosen 
instance of the a-calculus. 

Take p ¢ N. Let Ro := {Sp | s € {e,£,r}*}. The set R of private names is 
Ju” |u E Ro AvE {'}*}. Let S = {s1, 52,...} be an infinite set of spare names, 
disjoint from NV and R. Let Z := NWS and H := Z W R.4 

I take Act to be the set of all expressions œ from Table 2, as defined in 
Section 4 (in terms of Z and R), so # := Act\ {T}. The communication function 
y is given by 7(Mzy, Nvy) = |r=v|M Nr, just as for rule e-s-com in Table 6. 

For # = (£1,..., £n) EN” and Y = (y1,---,Yn) E H”, with the x; distinct, 
let {Y/Z}5 : SU {z1,..., £n} — H be the substitution o with o(a;) = y; and 
o(s;) = x; for i=1,...,n, and o(s;) = Si-n for i > n. These functions extend 
homomorphically to æ and thereby constitute CCS, relabellings. Abbreviate 
KIES] by [DE] and [{2/y}5] by Ph. 

For 7 € {é,r,e} and y € Z, let the surjective substitutions 7: R — R and 
py {ys} UR > {y} UR be given by: 


Py(y) = Pp 
np) = "p pp) =y 
np) := p? ifs nt pylu) := e(u) ifu #y,p. 


These o: H — H are injective, i.e., s[o]#y|o] when «fy. Also they yield CCS, 
relabellings. The following compositional encoding, which will be illustrated with 
examples in Section 12, defines my translation from mym to CCS,. 


TF (0) = 0 

TF (Mrt.P) := Mr.7(P) 

F(Miy.P) := Miy.7(P) 
F(Maly).P) = Zen Mez. (T (Peh) 
F((vy)P) = FPyp,) 

T(P|Q) = F(P)4 || 7Q)Ir] 

F(P+Q) = F(P)+ 7(Q) Pe 
T(A(¥)) = Aly/z| when A(z) = P 


where the COS, agent identifier A has the defining equation A = J (P) when 
A(T) Lf P was the defining equation of the agent identifier A from the z-calculus. 

To explain what this encoding does, inaction, silent prefix, output prefix and 
choice are translated homomorphically. The input prefix is translated into an in- 
finite sum over all possible input values z that could be received, of the received 
message Mxz followed by the continuation process 7(P)|#/y]. Here [%/y] is a 


CCS relabelling operator that simulates substitution of z for y in J (P). This 


4 The names in S and in R\Ro exist solely to make the substitutions {¥/Z}5, n and 
Py surjective. Here ø is surjective iff dom(o) C range(o). 
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implements the rule early-input from Table 6. Agent identifiers are also trans- 
lated homomorphically, except that their arguments 7 are replaced by relabelling 
operators. 

Restriction is translated by simply dropping the restriction operator, but 
renaming the restricted name y into a private name p that generates no barbs. 
The operator [py] injectively renames all private names ‘p that occur in the scope 
of (vy) by tagging all of them with a tag e. This ensures that the new private 
name p is fresh, so that no name clashes can occur that in mym would have been 
prevented by the restriction operator. 

Parallel composition is almost translated homomorphically. However, each 
private name on the right is tagged with an r, and on the left with an £. This 
guarantees that private names introduced at different sides of a parallel compo- 
sition cannot interact. Interaction is only possible when the name is passed on 
in the appropriate way. 

The main result of this paper states the validity of the above translation, 
and thus that CCS, is at least as expressive as 77: 


Theorem 2. For P € T, one has J (P) © P. 


See http://theory.stanford.edu/~rvg/abstracts.html#153 for a proof. 

Theorem 2 says that each 7-calculus process is strongly barbed bisimilar to its 
translation as a CCS, process. The labelled transition systems of the 7-calculus 
and CCS, are both of the type presented in Section 4, i.e. with transition labels 
taken from Table 2. There also the associated barbs are defined. By Theorem 2 
each 7 transition P > P’ can be matched by a CCS, transition J (P) +> Q 
with 7(P’) & Q. Likewise, each CCS, transition 7(P) + Q can be matched 
by a 7 transition P +> P’ with 7(P’) & Q. Moreover, if P has a barb x (or 
T) then so does J (P), and vice versa. Here a m or CCS, process P has a barb 
ac ZUZ iff P P’ or P2% P’ for some name y € H and process P’. 
Transitions P “24 p', p MEU} p p M% P' or Pp VU} P' with M £ e€ or 
x € R generate no barbs. 


10 The ideas behind this encoding 


The above encoding combines seven ideas, each of which appears to be necessary 
to achieve the desired result. Accordingly, the translation could be described as 
the composition of seven encodings, leading from mym to CCS, via six interme- 
diate languages. Here a language comprises syntax as well as semantics. Each of 
the intermediate languages has a labelled transition system semantics where the 
labels are as described in Section 4. Accordingly, at each step it is well-defined 
whether strong barbed bisimilarity is preserved, and one can show it is. These 
proofs go by induction on the derivation of transitions, where the transitions 
with visible labels are necessary steps even when one would only be interested in 
the transitions with 7-labels. There are various orders in which the seven steps 
can be taken. The seven steps are: 
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Fig. 2. Translation from the z-calculus with implicit matching to CCS, 
Definitions of the intermediate languages mm(Z,R) and Thy (Z.R) are not provided here. 


1. Moving from the late operational semantics (Table 3) to the early one (Ta- 
ble 4). This translation is syntactically the identity function, but still its 
validity requires proof, as the generated LTS changes. The proof amounts to 
showing that the same barbed transition system is obtained before and after 
the translation—see Section 6. 

2. Moving from a regular operational semantics (Table 4) to a symbolic one 
(Table 6). This step commutes with the previous one. 

3. Renaming the bound names of a process in such a way that the result is clash- 
free [3], meaning that all bound names are different and no name occurs both 
free and bound. The trick is to do this in a compositional way. The relabelling 
operators [£], [r] and [p,] in the final encoding stem from this step. 

4. Eliminating the need for rule alpha in the operational semantics. This works 
only for clash-free processes, as generated by the previous step. 

5. Dropping the restriction operators, while preserving strong barbed bisimi- 
larity. This eliminates the orange parts of Table 6. For this purpose clash- 
freedom and the elimination of alpha are necessary. 

6. Changing all occurrences of substitutions into applications of CCS relabelling 
operators. 

7. The previous six steps generate a language with a semantics in the De Simone 
format. So from here on a translation to MEIJE or aprACP p is known to be 
possible. The last step, to CCS}, involves changing the remaining form of 
name-binding into an infinite sum. 


As indicated in Figure 2, my translation maps the a-calculus with implicit 
matching to a subset of CCS}. On that subset, 7-calculus behaviour can be 
replayed faithfully, at least up to strong early congruence, the congruence clo- 
sure of strong barbed bisimilarity (cf. [11]). However, the interaction between 
a translated -calculus process and a CCS, process outside the image of the 
translation may be disturbing, and devoid of good properties. Also, in case in- 
termediate languages are encountered on the way from mm to CCS}, which is 
just one of the ways to prove my result, no guarantees are given on the sanity of 
those languages outside the image of the source language, i.e. on their behaviour 
outside the realm of clash-free processes after Step 3 has been made. 


11 Triggering 


To include the general matching operator in the source language I need to extend 
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the target language with the triggering operator s=P of MEIJE [1,34]: 


PSP’ 
s>P 55 P' 


MEIJE features signals and actions; each signal s can be “applied” to an action 
a, and doing so yields an action sa. In this paper the actions are as in Table 2, 
and a signal is an expression [r=y] with x,y € M; application of a signal to an 
action was defined in Section 6. 

Triggering cannot be expressed in CCS}, as rooted weak bisimilarity [2], the 
weak congruence of [19,20], is a congruence for CCS, but not for triggering. 
However, rooted branching bisimilarity [12] is a congruence for triggering [9]. 

My translation from mm to CCS, can be extended into one from the full 
m-calculus to CCS%"8 by adding the clause 


I ([e=y|P) := [e=y]>7(P). 


Theorem 2 applies to this extended translation as well. 


12 Examples 


Example 2. The outgoing transitions of x(y).gw are 


ey v (Bw 0) Py] Off 
Sadoc OU Ol — OF a 
a i T22 . 


Ban (Gw.0) [eny] Z O[2n/y] 


Here the z; range over all names in M. Below I flatten such a picture by drawing 
the arrows only for one name z, which however still ranges over M. 


Example 3. The transitions of P = x(y).gw | Zu.u(v) are 


(x(y)-gw)|zu.u(v) Be zw|zu.u(v) au O|Zu.u(v) 


(x(y).yw)|u(v) 2w|u(v) —"  Olu(v) 
uq twlu(v)_ | v4 Pad fua 


(x(y).gw)|0 — zwio 2” + ojo 
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Here tiw|u(v) is the special case of Zw|u(v) obtained by taking z := u. It thus 
also has outgoing transitions labelled uw and uq, for q E€ N. 

Up to strong bisimilarity, the same transition system is obtained by the 
translation 7(P) of P in CCS}. 


w= ( xz.((yw.0)[7/y] i 


zEH 


[> XC uz (02 Je 


zEH 


Since there are no restriction operators in this example, the relabelling operators 
[£] and [r] are of no consequence. Here 


T(P) > (gw.0)[Yy)[e l || > «20 (O[2/v])[r] > ofe || Of] [r]. 


zeH 


Example 4. Let Q = (vz) (x(y).gw | (vu)(Zu.u(v))). It has no other transitions 
than 
Q = (vx)(vu) (aw|u(v)) —> (vx)(vu)(0|0). 


Its translation 7(Q) into CCS, is 
D wot) pst [px 
zEH 


(x zz (eoD) (4 

zEH 

Up to strong bisimilarity, its transition system is the same as that of P or J (P) 
from a 3, except that in transition labels the name u is renamed into the 
private name ©"p, and x is renamed into the private name p. One has 7 (Q) * Q, 
since private names generate no barbs. 


Example 5. The process (vx)(x(y)) | (vz)(Zu) has no outgoing transitions. Ac- 
cordingly, its translation 


(x szot) pall 


zeH 


(tu)[px|[r] 


only has outgoing transitions labelled “pz for zeH and "pu. Since the names ‘p and 
Tp are private, these transitions generate no barbs. In this example, the relabelling 
operators |] and [r] are essential. Without them, the mentioned transitions 
would have complementary names, and communicate into a T-transition. 


Example 6. Let P = (vy)(Zy.yw) | x(u).u(v). Then 
P 5 (vy) (gw | y(v)) + (vy)(0|0). 


Now J ((vy)(Zy.gw)) = (Zy.yw.0) [py] and 


zEH zEH 
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Hence F ((vy)(Zy.yw)) [4 =n, (yw.0)[p,][¢]. Since the substitution r used in 
the relabelling operator [r] is surjective, there is a name s that is mapped to %p, 
namely “p’. Considering that 7(a(u).u(v)) => F (u(v))[S/u], 


(= (uz.0) a) [/u][r]. 


zeH 


T(P) > (yw.0)[py] [£] 


These parallel components can perform actions pw and pw, synchronising into 
a T-transition, and thereby mimicking the behaviour of P. 


Example 7. Let P = (vy)(Zy.(vy)(gw)) | (u).u(v). 
Then P +> (vy) ((vy)(gw) | y(v)) “A. One obtains 


(5 uz.(0[7/] ) [Sur] 


zeH 


T(P) => (Gw.0) [py] [Py] [2 


for a name s that under [r] maps to ‘p. Now the left component can do an action 
lenw, whereas the left component can merely match with Dw. No synchronisation 
is possible. This shows why it is necessary that the relabelling [p,] not only 
renames y into p, but also p into ‘p. 


Example 8. Let P = x(y).x(w).wu. Then 
P\zv.zy.y(v) = x(w).wulZ%y.y(v) 3 guly(v) = 00. 


Therefore, 7(P|Zv.%y.y(v)) must also be able to start with three consecutive 
7-transitions. Note that 


TF (Pltv.ty.y(v)) = F (P) 


Go sozu) ir] 


zEH 
with 


zEH zEH 


The only way to obtain J (P|zv.zy.y(v)) == -> is when J (P) 5 Q 224. 
The CCS, process Q must be 


(= z (ouo) [vy]. 
zEH 
Given the semantics of CCS relabelling, one must have 5 xz.((wu.0)[Z/w]) =>, 


zEH 
such that applying the relabelling [¥/y] to a yields xy. When simply taking [{¥/y}] 
for [Vy], that is, the relabelling that changes all occurrences of the name y in a 
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transition label into v, this is not possible. This shows that a simplification of 
my translation without use of the spare names S would not be valid. 

Crucial for this example is that I only use surjective substitutions. [0/y] is an 
abbreviation of [{U/y}5]. Here {U/y}% is a surjective substitution that not only 
renames y into v, but also sends a spare name s to y. This allows me to take 
a := as. Consequently, in deriving the transition X „ey 7z.((wu.0)[2/w]) >, I 
choose z to be s, so that 


5 xz.((wu.0)[2/w]) = (wu.0)[S/w] Ss OfS/w]. 
26H 


Putting this in the scope of the relabelling [¥/y] yields 
Q = (wu.0) [s/u] ey] > Ols] y] 
as desired, and the example works out.” 


This example shows that spare names play a crucial role in intermediate states of 
CCS,-translations. In general this leads to stacked relabellings from true names 
into spare ones and back. Making sure that in the end one always ends up with 
the right names calls for particularly careful proofs that do not cut corners in 
the bookkeeping of names. 

A last example showing a crucial feature of my translation is discussed in 
Section 14. 


13 The unencodability of CCS into m 


Let f : Æ > A bea CCS relabelling function satisfying f(a:y) = vi+1y. Here 
(x;)%29 is an infinite sequence of names, and & is as in Section 4. The CCS 
process A defined by 

A := xoy.0+7.(A[f]) 


satisfies IP. A > PA Pl», for all i > 0, i.e., it has infinitely many weak 
barbs. It is easy to check that all weak barbs of a m-calculus process Q must be 
free names of Q, of which there are only finitely many. Consequently, there is 
no m-calculus process Q with A © Q, and hence no translation of CCS in the 
m-calculus that is valid up to &.° 


14 Related work 


My translation from m4 to CCS, is inspired by an earlier translation E from a 
version of the z-calculus to CCS, proposed by Banach & van Breugel [3]. The 


5 This use of spare names solves the problem raised in [3, Footnote 5]. 
ê Tn [28] it was already mentioned, by reference to Pugliese [personal communication, 
1997] that CCS relabelling operators cannot be encoded in the z-calculus. 
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paper [3] takes y := {(x,y) | x,y E€ M} for the visible CCS actions; action 
(x,y) corresponds with my xy, and its complement (x,y) with my gy. On the 
fragment of a featuring inaction, prefixing, choice and parallel composition, the 
encoding of [3] is given by 


E(0) =0 

E(7.P) = 7T.€(P) 

E(Zy.P) := (a,y).€(P) 

E(a(y)-P) = View(s, 2)-(E(P) I) 
E(P|Q) := E(P)| EQ) 

E(P+Q) := E(P)+ E(Q). 


The main result of [3] (Theorem 5.3), stating the correctness of this encoding, 
says that P &, Q iff E(P) <,. E(Q), for all 7-processes P and Q. Here &, 
is strong reduction bisimilarity—see Definition 7. In fact, replacing the call to 
Lemma 3.5 in the proof of this theorem by a call to Lemma 3.4, they could 
equally well have claimed the stronger result that P ©, E(P) for all 1-processes 
P, i.e., that E is valid up to ©,. 

This result contradicts my Theorem 1 and thus must be flawed. Where it fails 
can be detected by pushing the counterexample process P := Zu | x(y).R with 
R := Yulv(w), used in the proof of Theorem 1, through the encoding of [3]. I 
claim that while P + du|v(w) —>, its translation €(P) cannot do two r-steps. 
Hence P %,. E(P). Using a trivial process Q such that P &, Q ©, E(Q), this 
also constitutes a counterexample to [3, Theorem 5.3]. 

Note that €(R) = (y,u).0 | X zen (v, 2)-(0[%/w]). This process can perform 
the actions (y, u) as well as (v,u), but no action 7, since y 4 v. Now 


E(P) = (x,v).0| X (z, z)-(E(R)[7/y))- 


zEN 


Its only 7-transition goes to 0 | E(R)[¥/y]. This process can perform the actions 
(v, u) as well as (v, u), but still no action 7, since [Y/y] is a CCS relabelling oper- 
ator rather than a substitution, and it is applied only after any synchronisations 
between (y, u).0 and J` „ey (v, 2)-(O[/w]) are derived. 

My own encoding 7 translates the processes P and R essentially in the same 
way, but now there is a transition .7(R) a, (O||O[4/w]). The renaming [/y] 
turns this synchronisation into a T: 


T(P) > F(R) Yy] => (Oloju) [Y]. 


The crucial innovation of my approach over [3] in this regard is the switch from 
the early to the early symbolic semantics of the m-calculus, combined with a 
switch from CCS as target language to CCS}. 

In [31], Roscoe argues that CSP is at least as expressive as the 7-calculus. As 
evidence he present a translation from the latter to the former. Roscoe does not 
provide a criterion for the validity of such a translation, nor a result implying 
that a suitable criterion has been met. The following observations show that his 
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transition is not compositional, and that it is debatable whether it preserves a 
reasonable semantic equivalence. 


(1) Roscoe translates T.P as tau > CSP[P], where — is CSP action prefixing and 
CSP[P] is the translation of the 7-expression P. Here tau is a visible CSP 
action, that is renamed into 7 only later in the translation, when combin- 
ing prefixes into summations. Thus, on the level of prefixes, the translation 
does not preserve (strong) barbed bisimilarity or any other suitable seman- 
tic equivalence. This problem disappears when we stop seeing prefixing and 
choice as separate operators in the z-calculus, instead using a guarded choice 
ier WP. 

(2) Roscoe translates x(y).P into ?z > csp[P{#/y}]. This is not compositional, 
since the translation of x(y).P does not merely call the translation of P as a 
building block, but the result of applying a substitution to P. Substitution 
is not a CSP operator; it is applied to the z-expression P before translating 
it. While this mode of translation has some elegance, it is not compositional, 
and it remains questionable whether a suitable weaker correctness criterion 
can be formulated that takes the place of compositionality here. 

(3) To deal with restriction, [31] works with translations CSP[P],,.,, where two 
parameters « and g are passed along that keep track of sets of fresh names to 
translate restricted names into. The set of fresh names ø is partitioned in the 
translation of P|Q (page 388), such that both sides get disjoint sets of fresh 
names to work with. Although the idea is rather similar to the one used here, 
the passing of the parameters makes the translation non-compositional. In 
a compositional translation CsP|P|Q] the arguments P and Q may appear 
in the translated CSP process only in the shape CSP[P] and cspP[Q], not 
CSP[P],.,0 for new values of o’. 


As pointed out in [14,29], even the most bizarre translations can be found valid 
if one only imposes requirements based on semantic equivalence, and not com- 
positionality. Roscoe’s translation is actually rather elegant. However, we do not 
have a decent criterion to say to what extent it is a valid translation. The ex- 
pressiveness community strongly values compositionality as a criterion, and this 
attribute is the novelty brought in by my translation. 


15 Conclusion 


This paper exhibited a compositional translation from the 7-calculus to CCS, 
extended with triggering that is valid up to strong barbed bisimilarity, thereby 
showing that the latter language is at least as expressive as the former. Triggering 
is not needed when restricting to the z-calculus with implicit matching (as used 
for instance in [33]). Conversely, I observed that CCS (and thus certainly CCS} ) 
cannot be encoded in the z-calculus. I also showed that the upgrade of CCS to 
CCS, is necessary to capture the expressiveness of the 7-calculus. 

A consequence of this work is that any system specification or verification 
that is carried out in the setting of the 7-calculus can be replayed in CCS}. The 


Comparing the expressiveness of the a-calculus and CCS 571 


main idea here is to replace the names that are kept private in the z-calculus by 
means of the restriction operator, by names that are kept private by means of a 
careful bookkeeping ensuring that the same private name is never used twice. Of 
course this in no way suggests that it would be preferable to replay 7-calculus 
specifications or verifications in CCS,. 

My translation encodes the restriction operator (vy) from the z-calculus by 
renaming y into a “private name”. Crucial for this approach is that private 
names generate no barbs, in contrast with standard approaches where all names 
generate barbs. This use of private names is part of the definition of strong 
barbed bisimilarity © on my chosen instance of CCS}, and justified since that 
definition is custom made in the present paper. The use of private names can be 
avoided by placing an outermost CCS restriction operator around any translated 
m-process. This, however, would violate the compositionality of my translation. 

The use of infinite summation in my encoding might be considered a serious 
drawback. However, when sticking to a countable set of 7-calculus names, only 
countable summation is needed, which, as shown in [8], can be eliminated in 
favour of unguarded recursion with infinitely many recursion equations. As the 
original presentation of the z-calculus already allows unguarded recursion with 
infinitely many recursion equations [24] the latter can not reasonably be forbid- 
den in the target language of the translation. Still, it is an interesting question 
whether infinite sums or infinite sets of recursion equations can be avoided in the 
target language if we rule them out in the source language. My conjecture is that 
this is possible, but at the expense of further upgrading CCS}, say to aprACP%. 
This would however require work that goes well beyond what is presented here. 

An alternative approach is to use a version of CCS featuring a choice quan- 
tifier [17] instead of infinitary summation, a construct that looks remarkably 
like an infinite sum, but is as finite as any quantifier from predicate logic. A 
choice quantifier binds a data variable z (here ranging over names) to a single 
process expression featuring z. The present application would need a function 
from names to CCS relabelling operators. When using this approach, the size of 
translated expressions becomes linear in the size of the originals. 

It could be argued that choice quantification is a step towards mobility. On 
the other hand, if mobility is associated more with scope extrusion than with 
name binding itself, one could classify CCS, with choice quantification as an 
immobile process algebra. A form of choice quantification is standard in mCRL2 
[15], which is often regarded “immobile”. 

My translation from m to CCS, has a lot in common with the attempted 
translation of m to CCS in [3]. That one is based on the early operational se- 
mantics of CCS, rather than the early symbolic one used here. As a consequence, 
substitutions there cannot be eliminated in favour of relabelling operators. 

A crucial step in my translation yields an intermediate language with an 
operational semantics in De Simone format. In [7] another representation of the 
m-calculus is given through an operational semantics in the De Simone format. It 
uses a different way of dealing with substitutions. This type of semantics could 
be an alternative stepping stone in an encoding from the z-calculus into CCS,. 
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In [28] Palamidessi showed that there exists no uniform encoding of the 7- 
calculus into a variant of CCS. Here uniform means that 7(P|Q)=7(P)|7(Q). 
This does not contradict my result in any way, as my encoding is not uniform. 
Palamidessi [28] finds uniformity a reasonable criterion for encodings, because 
it guarantees that the translation maintains the degree of distribution of the 
system. In [30], however, it is argued that it is possible to maintain the degree of 
distribution of a system upon translation without requiring uniformity. In fact, 
the translation offered here is a good example of one that is not uniform, yet 
maintains the degree of distribution. 

Gorla [13] proposes five criteria for valid encodings, and shows that there 
exists no valid encoding of the 7-calculus (even its asynchronous fragment) into 
CCS. Gorla’s proof heavily relies on the criterion of name invariance imposed 
on valid encodings. It requires for P € T, and an injective substitution ø that 
IF (Po) = F(P)o’ for some substitution o’ that is obtained from o through 
a renaming policy. Furthermore, the renaming policy is such that if dom(c) is 
finite, then also dom(o’) is finite. This latter requirement is not met by the 
encoding presented here, for a single name x € N corresponds with an infinite 
set of actions xy, the “names” of CCS, and a substitution that merely renames 
x into z must rename each action xy into zy at the CCS end, thus violating the 
finiteness of dom(o’). 

My encoding also violates Gorla’s compositionality requirement, on grounds 
that 7J (P) appears multiple times (actually, infinitely many) in the translation 
of Ma(y).P. It is however compositional by the definition in [10] and elsewhere. 
My encoding satisfies all other criteria of [13] (operational correspondence, di- 
vergence reflection and success sensitiveness). 
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Abstract. We introduce Concurrent NetKAT (CNetKAT), an extension 
of NetKAT with operators for specifying and reasoning about concur- 
rency in scenarios where multiple packets interact through state. We 
provide a model of the language based on partially-ordered multisets 
(pomsets), which are a well-established mathematical structure for defin- 
ing the denotational semantics of concurrent languages. We provide a 
sound and complete axiomatization of this model, and we illustrate the 
use of CNetKAT through examples. More generally, CNetKAT can be un- 
derstood as an algebraic framework for reasoning about programs with 
both local state (in packets) and global state (in a global store). 


Keywords: Concurrent Kleene algebra, NetK AT, completeness, concurrency 


1 Introduction 


Kleene algebra (KA) is a well-studied formalism [20,23,34,8] for analyzing and 
verifying imperative programs. Over the past few decades, various extensions of 
KA have been proposed for modeling increasingly sophisticated scenarios. For 
example, Kleene algebra with tests (KAT) [21] models conditional control flow 
while NetKAT [3,10] models behaviors in packet-switched networks. 

A key limitation of NetKAT, however, is that the language is stateless and 
sequential. It cannot model programs composed in parallel, and it offers no way 
to reason algebraically about the effects induced by multiple concurrent pack- 
ets. Meanwhile, the software-defined networking (SDN) paradigm has evolved to 
include richer functionality based on stateful processing including data aggrega- 
tion and dynamic routing. In languages like P4 [4], issues of concurrency arise 
because the semantics depends on the order that packets are processed. 

Given this context, it is natural to wonder we can add concurrency to NetKAT 
while retaining the elegance of the underlying framework. In this paper, we an- 
swer this question in the affirmative, by developing CNetKAT. However, to do 
this, we must overcome several challenges. A first hurdle is that networks ex- 
hibit many different forms of concurrent behavior. The most obvious source 
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of concurrency arises when multiple packets are processed by different devices. 
In these situations, certain packets may cause changes in forwarding behavior 
by modifying global state variables on switches. However, there is also concur- 
rency within individual devices: a high-speed switching chip often has multiple 
pipelines, each with multiple stages of match-action tables and stateful registers. 
The tables can be programmed to act concurrently on (parts of) a single packet, 
and the pipelines also act concurrently on multiple packets. 

Another hurdle is that it is not entirely clear how to simultaneously extend 
KA with networking features and concurrency. Orthogonal to the development of 
NetKAT, the issue of adding concurrency to KA has been researched extensively, 
starting with concurrent Kleene algebra (CKA) [13,25,26,17]. However, the com- 
bination of concurrency from CKA and tests from KAT is not straightforward— 
see, e.g. [14,15,16]—which motivated the development of partially-observable 
concurrent Kleene algebra (POCKA) [37]. In POCKA, a single thread only has 
partial view of the state. Hence, when evaluating control guards, a thread makes 
observations about the machine state, rather than definitive tests. This allows for 
fine-grained reasoning about concurrent programs with variables, conditionals, 
loops, and imperative statements that manipulate a shared global memory. 

In this work, we use POCKA as a basis for designing a language with state 
and concurrent threads, which we combine with a multi-packet extension of 
NetKAT. The resulting language, Concurrent NetKAT (CNetKAT), models the 
behavior of packets in a network that communicate through a shared global 
state, and addresses the fundamental and non-trivial question of how to combine 
concurrency and the interaction between local and global state within KA. 

Overall, the contributions of the paper are as follows: 


1. We present the design of the CNetKAT language (§3). The semantics com- 
bines the language models of NetKAT and POCKA, incorporating pomsets 
that record the evolution of the global state (as in POCKA) as well as sets 
of (output) packets (as in NetKAT). 

2. We develop a sound and complete axiomatization of CNetKAT (§4). 

3. We illustrate the applicability of CNetKAT for modeling and analyzing con- 
current network behaviors through case studies and examples (§2 and §5). 


The next section contains an overview of the challenges in the design of 
extending NetKAT with multiple packets, global state, and concurrency, as well 
as a glimpse of how to use the language in a practical example. 


2 Overview 


CNetKAT models the behavior of two basic entities: the packets being routed 
through the network, and a global store, which may be accessed by the network as 
it processes the packets. These elements give rise to two kinds of basic programs. 
On the one hand, basic packet programs—imported from NetKAT [3]—include 
tests (f;=n) and modifications (f;<-n) of packet fields f1,..., fv. Examples of 
fields are sw, denoting the switch of the packet in the network, and tag, denoting 
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pi Ê sw = 1; ((v = 1 ; tag = @; swe 2) 


ja . | (tag = O ; w43; v41) 
p2 = sw = 2; sw 4 


p3 £ sw = 3;swe4 
A 
pa=sw=4 


p= v0; (pı || p2 || ps || pa)“ 


Fig. 1: Running example 


the type of a packet. In general, we expect packets to have fields for a collection 
of standard attributes; unused fields may be populated with a dummy value. 
On the other hand, basic state programs include observations* (v;=n), mod- 
ifications (v;+}n) and a copy operation (v;<-v;) on state variables v1,..., Um. 
It will always be clear from context whether an action concerns a state or field 
variable. CNetKAT also includes a primitive program a for any set of packets a, 
which is useful for specifying the set of packets currently being processed. 


Remark 1. We could augment the set of primitives with features such as general 
expressions in assignments. However, to keep things simple, we will only consider 
these primitives, which are already rich enough to describe non-trivial behaviors. 


ES 


CNetKAT programs are composed using sequential composition (‘;’), itera- 
tion (‘x’), and non-deterministic choice (‘+’), similar to NetKAT. In addition, 
CNetKAT programs may use the parallel composition operator (‘||’). 

The full syntax of CNetKAT is given in Figure 2. Before giving a precise 
account of the semantics, we will go over some simple example programs. 


Example 1 (Packet forwarding). Consider the network depicted on the left in 
Figure 1. Similar to NetKAT, we assume packet movement and variable assign- 
ments are instantaneous. Suppose there are two packet types: @ and Y. We want 
to write a program that transfers packets from node 1 to node 4 by sending @ 
via node 2, and via node 3. The program running in switch 1 could be 


pı := sw=1; ((tag=@; sw+2) || (tag=V ; sw+3)) 


This program first filters out the packets at switch 1. Next, it launches two 
parallel threads, both of which receive a copy of the incoming packets. The first 
thread filters out packets of type @ and forwards them to switch 2, while the 
second thread filters out packets of type Q, forwarding them to switch 3. 

We can write programs p2, p3 and p4 for the other switches as well, and then 
compose all of those in parallel to obtain a program for the entire network. 


Remark 2. Instant packet movement is not baked into CNetKAT, but rather a 
consequence of modeling packet location using the field sw. A more advanced 


4 Intuitively, these are tests on the state that can be understood as observing the part 
of the global state containing the variable, hence the terminology. 
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model could use an additional field to mark a packet as being “in-flight” until it 
reaches the next hop. Here, we opt for the simpler model. 


Example 2 (Global behavior). CNetKAT programs can read and write to a global 
store, letting earlier actions on packets affect later decisions. For instance, sup- 
pose we need @ packets to be forwarded only if a Ọ packet already visited switch 
3. We can use a global variable v to implement this stateful behavior, writing: 


sw=1; ((v=1; tag=@; sw¢2) || (tag=O ; sw+3 ; v<-1)) 


We can program the other switches with p;, as shown in Figure 1. 


Remark 3 (Concurrency and state). Actions involving global variables are more 
subtle than those that concern packet fields, due to concurrent threads accessing 
the global store. For instance, we can write the program v<1 ; v=2, which 
first sets v to 1 and then asserts that v should have value 2. This may seem 
inconsistent; however, there may be valid ways of executing this program if there 
are other threads that change the value of v from 1 to 2 between the assignment 
v1 and the assertion v=2. This possibility makes defining a compositional 
semantics somewhat tricky, as we will discuss below. 


Semantics of CNetKAT programs. A packet 7 is a record of fields f1,..., fv. 
We write (sw) for the value of sw in 7 and m[1/sw] for the packet obtained after 
updating the value of sw to 1. We denote the set of packets by Pk. 

The semantics of a CNetKAT program is represented as a function that takes a 
set of packets, potentially located in different nodes in the network, and returns a 
set of possible behaviors that those input packets might produce. More precisely, 
the semantics function has type [—]: 2P% > g%m2"™ Here, Rm is the set of 
pomsets [12,11], which can be thought of as structures that record the causal 
order between concurrent events (details appear in Section 3.1). An element 
u-b € [p](a) means “there is an execution of p that changes the global variables 
according to u, and the set of output packets produced is b”.° 

The semantics is defined in Figure 3. For instance, a packet filter (f=n) takes 
a set of packets a and returns {1-a(f=n)}, where a(f=n) contains all packets 
in a where f has value n and 1 is the pomset representing that the global state 
did not change. A modification (fn) takes a set of input packets a and returns 
{1 -a(f < n)}, where a(f 4+ n) = {x[n/f] : m € a}. These two basic packet 
actions manipulate the local state of the program. 

On the global state we have observations of the form (v=n) and modifications 
(ven), (vv’). Each gives rise to a pair in the semantics—{v = n-a}, {(ven)- 
a}, {(v<-v’) -a}—in which the input set of packets a is returned as output and 
the assertion or modification is recorded in the pomset. 

Lastly, the primitive a € 2P* is useful for writing specifications. This program 
copies the set of packets a into the global pomset. We will see that this is useful 
for checking inclusion of certain behaviors in a program’s semantics, and in the 


5 We use the notation - to denote pairs: u-b denotes the pair (u, b). 
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Syntax 
Values Val Ən  ::=O0/1|2|--- State Fields Var 3 v z= v |---| 
Packet Fields Fld 3f = fı | ++- | fe Global State St > a, 6 ::= Var — Val 
Packets PkƏm == {f = m,-.-} State Act 3e x= 
Packet Sets 2™ > a,b actions ven Change 
Packet B>t,u n= viv Copy 
predicates drop False Programs Prg 3 p,q ::= 
pass True abort Abort 
f=n Field Test skip Skip l 
t Vg u Disjunction t Packet Filter 
t Agu Conjunction o State Obs 
at Negation fin Packet Action 
State O ES e State Action 
ney i dup Duplicate 
obs. blll Inconsistent ; 
T Neutral p+q Choice 


p;q Sequence 
p||q Parallel 

p* Iteration 

a Packet Sets 


v=n_ State test 
oVo’ Union 

oo Intersection 
o Complement 


Fig. 2: CNetKAT syntax. We highlight constructs not in NetKAT. 


proof of completeness. Formally, the behavior of a on any input set b is {a - b}, 
where a is the global state pomset with one node labeled by a. 

To construct more complicated programs, we can combine the basic elements 
above using operators from Kleene algebra. For instance, p+ q is a program that 
represents a non-deterministic choice between p and q. Its semantics is obtained 
by taking the union of sets produced by both p and q on the input packets. We 
can also compose programs sequentially using p;q, where we first apply p to the 
input packets and then q to all sets of packets produced by p, and we compose 
the corresponding global pomsets sequentially. We can iterate a program finitely 
many times using p*. Lastly, we can combine programs with a parallel operator, 
p || q, which denotes a program that, on input a, executes both p and q on a, 
and then combines the results: the pomsets denoting the global components are 
composed in parallel, and the corresponding sets of output packets joined. 


Remark 4 (Concurrency and state, continued). Note that statements observing 
or modifying global variables are stored in the pomsets but not executed, that 
is, we do not actually check immediately whether v is indeed 1 but rather simply 
record it. This may seem like an odd choice at first: why does the semantics not 
also keep a record of the global store? The reason is related to Remark 3. 
Consider the program q = (v=0) ; (v=1), which asserts that v has value 0, 
and then that it has value 1. In isolation, q does not have any valid behavior, 
as it sequentially executes two tests that cannot be valid without intermediate 
intervention. However, the program q || (v41) does have valid behavior on some 
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lp]: gPk _, fom (StUACtU2PK).2Pk 


Semantics 
[p](2) = {1-2} [o](a) £ St* © [oJ © St* x {a} 
abort] (a) = Ø [e](a) = St* © {e} © St* x {a} 
[skip] (a) £ {1 - a} u-a' € [p](a) 
i oan DEn) DE fuv P| vbe iy} 
<ni(a) {1 -a(f en u: a 
FA pI ado) ê fav eue | PE Plot 
[dup] (a) £ {a- a} ie A . 
peal) e piduka PIO Uile zo: en} 
Predicates Observations 
[tls (a) : 2” ost 
[drop (a) 20 a Ta 
Evs ulel) ê Eiso) Usa) Boetes a) =n} 
meus stisinisigen bx 2ldo vio 
[tlel =a \ Hel) [lo 2ULZ € P<(St) | lolo NZ = 2} 
Filtering, updates and downwards closure ac? 7 CSt 


a(f =n) = {r €a|x(f) =n} alf <n) = {x[n/f] | x € a} 

a < 6B <=> domain(B) C domain(a) A Va € domain(). a(x) = B(x) 

Z< ={a|ABeZsta< 8} P<(St) = {Z | Z C St A Z = Z<} 
Fig. 3: CNetKAT semantics. Pairs u-b in [p] (a) indicate that the program p takes 
input a and the global state change induced by p is encoded in u and constrains 
the final packet set b. We overload - for sequential composition of pomsets and 
pairs, while © is the usual lifting from pomsets to languages. 


interleavings—namely the ones where the assignment v+1 is scheduled between 
the two tests. It stands to reason that a compositional semantics of such programs 
should include traces with such local inconsistencies, as they may be explained 
by actions taken by other programs running in parallel [37]. For CNetKAT, this 
is accomplished by placing the observations and modifications in the pomset. 
This leaves us with the question of how to obtain the semantics of a program 
in isolation. We take a page from POCKA [37], which uses the set of guarded 
pomsets to filter out the pomsets sensible in isolation; details appear in §5. 


One final modification is needed to obtain the CNetKAT semantics from [—]. 
The idea is to allow interleaving between parallel threads [13]. This is accom- 
plished by adding to the semantics all pomsets in which events are “more or- 
dered” than the ones already present in [—]. We denote this closed semantics 
by [—] |; a precise definition is given in §3. 
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Recording local behavior To apply CNetKAT to various verification tasks, 
we sometimes need to take snapshots of the local state at different points. For 
example, if we want to argue that V packets arrived at switch 3 before @ pack- 
ets arrived at switch 2, we need more than the information about inputs and 
outputs that have occurred so far. We therefore have to extend the language 
with an operator comparable to dup in NetKAT. On input a, the semantics of 
the dup operator is the set {a- a}, where the first component is a single node 
pomset labeled with set of packets a. By recording packets inside the pomset, 
information about changes to packets also contains their relation to changes to 
global variables during the execution. Hence, using dup, we can infer causality 
relations between local and global state changes. 

The programs pı, p2,p3 and p4 used in our running example (see Figure 1) 
can be instrumented with a dup on every entry to and exit from a switch. This 
encodes extra information in the semantics that can be used for reasoning about 
packet-forwarding paths as well as global state changes. 


pı =sw = 1 ; dup; ((v = 1; tag = @; dup ; sw+2; dup) 
|| (tag = Q ; dup; sw + 3 ; dup ; v + 1)) 
p2 = sw = 2 ; dup ; sw+4 ; dup 
p3 £ sw = 3;dup;sw<4; dup 
A 
pa = sw = 4 ; dup 


The overall program of the running example then becomes 


p =v0; (pı || p2 || ps || pa)” 


where the global variable v is initialized to 0, and the programs pj, po, p3, pa 
are executed in parallel, performing the actions of each individual switch. The 
Kleene star ensures that the packets may take multiple hops through the net- 
work, eventually reaching their final destination (switch 4). 


Remark 5. If a dup occurs in parallel to other threads, then these other parallel 
threads can only change the exact place of the dup-recording in the pomset via 
possible interleavings, but not influence its content. 


Remark 6. We model the collection of in-flight packets as a set, as opposed to 
e.g. a partially ordered set encoding their order of arrival. This is an abstrac- 
tion of our framework. Not putting an order on packets simplifies the algebraic 
presentation and has the advantage that it enables modeling of switches that 
reorder packets without an additional primitive. If the order of packets is im- 
portant, information about this order can be extracted from the semantics. In 
particular, when packets were forwarded can be deduced by inspecting the sets 
of packets recorded in the pomset component using dup. 


Two differences between CNetKAT and NetKAT Readers familiar with 
NetKAT might wonder why Example 1 uses || instead of + to compose the 


6 We overload ‘a’ as a set of packets, a programming primitive and a label used in 
pomsets, but it always denotes a set of packets in the latter two uses as well. 
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branches of pı. The reason is that in CNetKAT, || is interpreted as multicast 
and + is interpreted as non-deterministic composition. In NetKAT, programs 
act on a single input packet, so these coincide. But in CNetKAT, programs act 
on multiple packets concurrently, so they must be distinguished. 

To illustrate the difference, consider wanting to filter the input packets so that 
only those where field f has value n or field g has value m remain. In NetKAT, 
we can use the program f=n + g=m, which can be understood in two different 
ways. First, we can think of it as using (angelic) non-determinism to select a test, 
yielding {7} if at least one test passes and Ø if both tests fail. Alternatively, we 
can think of it as using multicast to copy the input to both f=n and g=m, then 
using the tests to perform the required filtering, and finally taking the union of 
the resulting sets. In NetKAT, the net effect of both interpretations is identical, 
so multicast and non-determinism can be identified semantically. 

However, when we generalize to sets of packets, it is natural to expect that 
processing a set a with f=n followed by g=m would yield the subset of a where 
each packet satisfies at least one of the tests. Operationally, processing a using 
these programs could be realized by making two copies of a, then using the tests 
to perform the required filtering, and taking the union of the resulting sets. This 
is reflected in the semantics: [f=m || g=n](a) = {1 - (a(f = m) Ua(g = n))}, 
where we get a single pair in the output. If instead we non-deterministically 
choose between the tests, the result would be the subset where f = n or the 
subset where g = m. Indeed, we have that [f=m + g=n](a) = {1-a(f =™m),1- 
a(g = n)}. Hence, multicast and non-determinism can no longer be identified in 
the context of multiple packets. For readers familiar with NetKAT, this means 
that the Boolean disjunction V is now identified with || rather than +. 

Lastly, we highlight that CNetKAT’s dup is fundamentally different from 
NetKAT’s dup, which just records versions of the packet during execution. In 
CNetKAT, dup does two things: it implements the same functionality as in 
NetKAT, but also structures the recording of packets inside the pomset. 
Proving properties with CNetKAT In 85, we analyze the behavior of the 
running example in detail and show how to filter out the behaviors of p that can 
be obtained when it is run in isolation. In this overview, we establish a simpler 
property: namely, that p exhibits executions where the packets were at switch 3 
before they were at switch 2. We first argue this using the denotational semantics 
and then illustrate how we can establish the same fact with axiomatic reasoning. 

Recall a pomset accounts for events and the ordering between them. In the 
following examples, we will depict pomsets as a graph with nodes labeled by 
state actions, observations and sets of packets, and the ordering indicated by 
arrows. For instance, a + b means that a happened before b. 

We evaluate p on input {V,@}, where both packets start at switch 1. In the 
closed semantics [p] | ({V,@}) we find the following pomset (the --- indicate 
that the pomset continues on the next line, not that nodes are omitted), in the 
first projection, with 8 a partial function from Var to Val s.t. 8(v) = 1: 


(v0) > {9, @} > {9} > {9[3/sw]} > (v1) > 8>- 
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_, {@[2/sw]} > {@l[4/sw]} _, 
-> {4} > {4[2/sw]} _, TA[4/sw], O[4/sw]} 
{9[3/sw]} > {O[4/sw]} 


Every node labeled with a set of packets can be understood intuitively as 
“at this point in the execution these packets were a subset of the total packets 
present in the network.” We can observe in the pomset that the Y packet was at 
switch 3, before the @ packet reached switch 2. We also see that v¢<-1, happens 
between v0 and £. In the end, both packets are observed at switch 4. 

The second projection in the semantics corresponding to this pomset is the 
set of output packets {@[4/sw], O[4/sw]}. 

In the full version of this article [38, Appendix E], we show something 
stronger: in all behaviors that can happen in isolation, the packet O[3/sw] is 
recorded into the global pomset before the assignment v+—1, which precedes the 
observation that v equals 1 and the generation of the packet @[2/sw]. 

We can write an axiomatic statement that captures that the above behavior 
is in the closed semantics of p on input {Y, @}. To do this, we first need to cap- 
ture the pictured global state pomset with corresponding set of output packets 
syntactically, for which we use an abbreviation. Namely, we can write a program 
that outputs, on any input, a specific packet: for a packet 7, we write this pro- 
gram simply as m. The output of [r] on any input is {1- {7}}. This extends 
to sets of packets: Q || @ denotes a program whose semantics is {1- {9 || @}} 
on any input. This notation pairs well with the use of the letters a € 2P% as 
programming syntax: if we know which set of packets we (want to) record into 
the global state pomset with dup, we can also directly write this set of packets 
in the program as a syntactic letter. For instance, the program ( || @) ; dup, 
has the same behaviors as (Q || @); {U, @}: the moment we execute the dup, we 
know the current set of packets is {Y,@}, and thus writing this set of packets 
as a letter and recording that letter into the global state pomset will have the 
same result. Using these two pieces of information, we can write the program 


q £ (w40); {9, A}; {9} ; {013/sw]} (VL) ; w=1); {8}; (1) 
{A [2/sw]} ; (({@[2/sw]} ; {@[4/sw]}) || ({918/sw]} ; {OT4/sw]})) 5... 
.. {@[4/sw], O[4/sw]}) ; (@[4/sw] || O[4/sw)) 


The first chunk of this program is the syntactic encoding of the desired global 
state pomset, where the Q packet arrives at switch 3 before the @ packet arrives 
at switch 2, and the final parallel of packets represents the set of output packets. 
We can prove using the axioms of CNetKAT that 


(VIA); < (O| A); p (2) 


(2) states that the behavior of q on input {9, @}, is included in the behavior 
of p on the same input. In the behavior of q, it is clear that the Y packets are 
observed at switch 3 before the @ packets appear at switch 2. 
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Remark 7 (Generalized alphabet). Here we see the use of sets of packets as 
letters in the program syntax. Program q is much closer to the behavior we try 
to capture, and therefore easier to analyze, than a program containing dup. 


To check the validity of equivalences such as (2), we axiomitize CNetKAT and 
prove it sound and complete. The axioms include the axioms of KA, extended 
with additional axioms for operations that manipulate packets and the global 
state. The full axiomatization appears in Section 3.4. For instance, drop;q = drop 
states that no outputs are produced in the absence of inputs. The program 
drop drops the set of inputs and returns {1 - Ø}. Any program q after drop 
outputs {1 - Ø}, because q is not executed when the input is empty. In contrast, 
q; drop = drop does not hold since q might have changed the global state. 

In addition to drop, CNetKAT has a program abort, which acts as a unit for 
non-deterministic choice (+). To illustrate the difference between abort and drop 
consider (f=n) ; (f=m) and (v=n) A (v=m), where m 4 n. The first program 
filters using f = n and and then filters using f = m where m Æ n. This yields 
{1 - Ø}, since a packet cannot have different values for f. Hence, we can derive 
(f=n) ; (f=m) = drop. The second program asserts the global state variable 
v has value n and m, which is inconsistent; we require variable v to have two 
different values at the same time. Hence, from the axioms we can derive that 
(v=n) A (v=m) = L = abort. 

We prove in §4 that the axiomatization presented in Section 3.4 is not only 
sound but also complete—i.e., all programs with the same semantics can be 
proved equivalent using the axioms. The rest of the paper is devoted to presenting 
the CNetKAT syntax and semantics formally (§3), and establishing conservativity 
results over NetKAT and POCKA. Lastly we present a case study (85). 


3 Concurrent NetKAT 


This section defines the syntax and semantics of CNetKAT formally. 


3.1 Pomsets and pomset languages 


For a poset (X,<) and a set S C X, define the downwards-closure of S by 
S< u= {x | dy E€ Ssta < y} and Po (X) := {Y C X | Y = Y<}. It is 
well-known that P<(X) carries the structure of a bounded distributive lattice, 
with intersection as meet, union as join, X as top and @ as bottom. Further, if 
(X, <) is finite, the lattice is itself finite and thus carries a (necessarily unique) 
pseudocomplement defined by Y ::= U{Z € P<(X) | Y N Z = Ø}. We provide a 
concrete lattice with a pseudocomplement below. 


Pomsets are used to capture the different evolutions of the state as it is ac- 
cessed concurrently by different threads. Pomsets are labeled posets (up to iso- 
morphism), used as a generalization of words [11,12]. A labeled poset over a finite 
alphabet X is a triple u = (Su, <u, Au), where (Su, <u) is a partially ordered 
set and Ay: S > X is the labeling function. For u,v labeled posets, we say u 
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is isomorphic to v, u = v, if there exists a bijection h: Su —> Sy that preserves 
labels — A, oh = \y— and preserves and reflects ordering— s <u s’ if and only 
if h(s) <v h(s’). A pomset over X is an isomorphism class of labeled posets over 
X, i.e., the class [v] = {u | u = v} for some labeled poset v. Because pomsets are 
label-preserving isomorphism classes, the nature of the carrier is not relevant, 
only its cardinality and order. The triple u = (Su, <u, Au) is a representation of 
the pomset. However, often we abuse terminology and call u the pomset. 

We write Rm( X) for the set of pomsets over X, and 1 for the empty pomset. 
When a € X, we write a for the pomset represented by the labeled poset with a 
single node labeled by a. Pomsets can be composed sequentially and in parallel. 

The parallel composition of two pomsets is obtained by taking the disjoint 
union of the carriers, while keeping the ordering relations within each component. 
Formally, u || v = (Salv SaljeAale with Sully = Sua + Sy, Sully = Su U Sv 
and Ayiv(@) = u(x), for x € Sy, and Ayjy(z) = Av(x), for £ € Sy. Two 
pomsets are composed sequentially by taking the disjoint union of the carriers 
and ordering all elements of the first before all elements of the second, keeping the 
ordering relations within each component. Formally, u-v = (Suv, Suv, Auv), 
with Suy = Syt Sy, Suv = Su US yU (Su x Sy) and u.v = Aullv- 

Gischer introduced a notion of ordering on pomsets [11]: u E v means that 
u, v have the same events and labels, but u is “more sequential” than v in the 
sense that more events are ordered. Formally, u E v if there exists a label- and 
order-preserving bijection h: Sy > Sy. 

Pomset languages are simply sets of pomsets. The operations on pomsets 
lift pointwise to pomset languages, see Figure 3. The semantics of concurrent 
threads requires ensuring a closure property. In particular, we will close pomset 
languages under the subsumption order of Gischer. Additionally, for pomsets 
that contain nodes labeled by observations, we make use of a contraction order: 
u < v, capturing that u results from v by eliminating consecutive observations 
that can be collapsed into one. As an example, consider 


A~a 


a 
ae ae 


Denote these pomset with u and v respectively, and let a € St. Then u < v. A 
formal definition can be found in the full version of this article [38, Appendix A]. 


Definition 1 (Closure). Let L be a pomset language. 


ae = {u | IveLstul v} Lee = {u | Jve Lst. u< v} 


We define T,contrexch as the smallest language containing L and satisfying that 
ifv E pone andu <vorul v, then u € Lene, 


Closure under E is called exch because it ensures soundness of the exchange 
law, an axiom introduced in [13] to capture the possibility of interleaving. Closure 
under contraction is motivated algebraically; it ensures soundness of one of the 
axioms necessary when adding a test algebra (a PCDL or a BA) to a KA [16]. 
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3.2 CNetKAT: syntax and semantics 


CNetKAT expressions denote (possibly concurrent) packet processing programs 
that have access to a global state. Syntactically, CNetKAT is a language built 
from alphabets of tests and actions, each of which is divided in two categories. 
For packet tests, we firstly inherit NetKAT’s packet predicates, which are elements 
of a Boolean algebra generated by an alphabet of basic tests on packet fields. 
Packet predicates t, u include constants drop and pass, denoting false and true, 
basic tests f=n, negation ~t, disjunction tVgu and conjunction t\gu operations. 

Additionally, we have state observations, which do not have the structure of 
a Boolean algebra but instead form a pseudocomplemented distributive lattice. 
Intuitively, the functions denoting the state are partial. State observations o, o! 
include constants L and T, basic tests v=n, pseudocomplement 0, intersection 
oo’ and union oVo’. The other constructs were introduced in §2 (see Figure 2). 

The semantics of a program is a function [|]: 2PK — 2®=(StuActu2™):2™ that 
takes a set of packets a and produces a (possibly empty) set of pairs u-b consisting 
of a pomset u, recording the global state behavior and the storage of local packets 
whenever dup is used, and a set of packets b. On an empty input set, every 
program produces {1 : Ø}, modeling that nothing can happen without packets. 
Producing the empty set when the input is non-empty models a program that has 
aborted, whereas producing a set {1-2} models dropping all the packets without 
any change to the state. Most of the semantics was already explained in §2; in 
the following we elaborate on some behaviors and illustrate subtleties concerning 
the units. See Figure 3 for an overview of the full denotational semantics of 
CNetKAT. 

On a non-empty input a, a packet filter t removes packets in a that do 
not satisfy predicate t and does not touch the state — this is captured by the 
set {1 - [t],,(a)}, where [#],(a) is interpreted as an element of the Boolean 
algebra (2°,U,M, Ø, a, \) defined by the poset (2°, C), and [t],,(a) is defined as 
the homomorphic extension of [f=n],(a) = {r € a| a(f) =n}. 

A state observation denotes a function that returns a set with elements u -a 
when applied to a set a. In case the original input set a is empty, nothing happens 
and the output of |o] (a) is simply {1-2}. When a is not empty, the semantics of o 
makes use of an observation algebra developed in [14,37]. More formally, we take 
the pseudocomplemented bounded distributive lattice (P<(St),U,M, St, Ø, -,) 
generated by the poset (St, <) with a < 8 if and only if domain(8) C domain(a) 
and Vz € domain(Z).a(z) = (x). Then, a state observation is interpreted as 
St* - Jo], - St* x {a}, where [o], is an element of P<(St) and defined as the 
homomorphic extension of the assignment [u=n]o = {a € St | a(v) = n}. Intu- 
itively, in [o] 5, we find all the partial functions (elements of St) that agree with 
o. For instance, [v=n] contains all partial functions that assign n to v. This 
also illustrates the need for a pseudocomplement rather than a complement: if 
threads have only partial information about the state, an observation should be 
satisfied only if there is positive evidence for it. Hence, e.g. v=" should be satis- 
fied only if v has a value and it is not n, which is not captured by the complement 
from a Boolean algebra — the complement would also include partial functions 
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that do not assign a value to v in the behavior of v=7. This is incorrect, because 
if v has no value in a partial observation, we might learn later that the actual 
value of v was in fact n, and it was therefore incorrect to assert U=7. 

State modifications are interpreted as a set of elements u-a when applied to a 
set a. The pomsets u record the state modification surrounded by arbitrary state 
observations; in the first projection of the semantics of the assignment vn we 
get a set of possible pomsets: St* © {v + n} © St*. 


Remark 8. We surround state changes and observations with arbitrary sequences 
of states to include global pomsets that have alternating modifications and states 
in the semantics. Reasoning about behavior of programs is more practical using 
such alternating pomsets, because the states allow one to take stock of the 
configuration of the machine in between modifications. The semantics contains 
also non-alternating pomsets to ensure compositionality w.r.t the parallel. 


CNetKAT has six different syntactical units, some of which coincide semanti- 
cally. There are two units for packets: drop, which drops all the packets ({1-@}), 
and pass, which passes the current packets without changing the state ({1-a} on 
input a). Similarly, we have two units for state observations: L and T. The first 
one indicates an inconsistent state, and therefore the whole program exhibits no 
behavior; its behavior is @. The second one indicates any state observation is 
acceptable, and its behavior on input a is {s -a | s € St}. Lastly there are two 
units for programs in general: abort, the program without behavior, and skip, 
the program where nothing happens (on input a its semantics is {1-a}). Hence, 
abort is equivalent to L and skip equivalent to pass. All units behave as {1- Ø} 
when the input set is @, because nothing happens when there are no packets. 

The CNetKAT semantics consists of pairs of global state pomsets and sets 
of output packets. It might be possible to encode the information of the output 
packets as a final node in the pomset, but keeping the set of output packets 
separated allows us to easily track the input-output behavior of a program in 
terms of packets. This brings CNetKAT closer to NetKAT and its packet process- 
ing behavior. In particular, the NetKAT packet processing axioms, can only be 
used because we track the input-output behavior of the program separately. 

To obtain the full semantics, and ensure we capture correctly the intended 
behavior, we need to perform a closure on the state component. 


Definition 2 (Closed Semantics). Given a CNetKAT policy p, we define the 
semantics of p when applied to input a € 2P% as 


[Pll (a) = {u -b| v-beé [p](a),ue gt 


Closure under exch and contr formalizes important intuitions about the se- 
mantics of concurrent threads. The closure under exch ensures all traces resulting 
from interleaving threads are included, and the closure under contr specifies that 
if two observations hold simultaneously, then it is possible to observe them in 
sequence. Note that the converse should not hold as some action could happen 
in between the two observations in a parallel thread. 

We distinguish state, packet and deterministic packet programs as follows. 
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Definition 3 (State and deterministic packet programs). Let Tpacket de- 
note packet programs, which are programs generated by the following grammar: 


pqu=teBu{fen|f €Fld,n € Val} | p+q | piq | pila | vp 
Let Tetate(&’) denote state programs over alphabet X: 
s,v ::= abort | skip | uE X | stu | s;v | silvu | s* 
7 


Let Tdet—pack denote deterministic packet programs:' : 


v,yu=teBu{fen| fe Fldne Val} | z;y | «ily 


In this paper we mostly use state programs over alphabet OUActU2"*U {dup}. 
Whenever we intend to use this alphabet, we simply write Tetate. 
We prove the following lemmas regarding the CNetKAT semantics. 


Lemma 1 (State and packet program semantics). Let p E€ Tpacket, S € 
Tarate and a € 2PK. For all w € [p](a), w is of the form 1-b forb € 2P*. For all 
w € [s](a), w is of the form v -a for v a pomset over St U Act U 2P, 


For non-empty sets of packets a and a’, the global behavior of a state program 
without dup is identical on both inputs. Let 2°* denote 2P* \ {Ø}. 


Lemma 2. Let s € Tetate(O U Act U 2P*). For all aja’ € 2Pk we have 


{u|u-b€ [s](a)} = {u|u-be [s](a’)}. 
We characterize [—],, in terms of its behavior on subsets of the input set. 
Lemma 3. Lett € B and a,b C Pk. Then [t],(a Ub) = [t] g(a) U [tlg (b). 


Lastly, we have a lemma characterising the semantics of a deterministic 
packet program in terms of its behavior on subsets of the input. 


Lemma 4. Let © € Tdet—pack and a,b C Pk. Then 


[z](@Ub) = {1- (eu d) | [2](a) = {1 - c}, [2] (0) = {1 - d}. 


3.3 Is CNetKAT conservative over NetKAT and POCKA? 


CNetKAT combines NetKAT and POCKA, so it is natural to ask whether it is a 
conservative extension of either language. It turns out that the answer is positive 
for POCKA, and for a fragment of NetKAT. We start by recalling the semantics 
of NetKAT [3]. Note that NetKAT expressions are packet programs without ||. 


$ Equivalently, we can define Tpacket by adding a predicate H to the signature of our 
algebra that counts the number of *’s and +’s a term contains, and a packet program 
p is an element of Taet—pack if and only if p € Tpackee and H(p) = 0. 
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Definition 4 (NetKAT semantics). Let n € Pk, t € B and p,q NetKAT terms. 
lln) = itler} [pass] yx (1) = {7} [drop] yx (7) = {} 
[fn] nk (r) = {aln/ Ff] lp; aux) =U laln Cr’) 


n’ Elp] (r 


[p*]nk(7) = ER [Pp In (™) lp + alnk (T) = Lely (7) Y Talk (7) 


Theorem 1. Take m € Pk and NetKAT term p. [p] yx(™) = Un.wetpyay o 


We can derive a further relation between the semantics if we assume there is 
no use of + and * (the proof uses Lemma 3). 


Lemma 5. Let p be built out of packet predicates and modifications (fn), and 
their sequential composition. Then [p](a) = {1 -Upea [Pl yx ()}- 


It is worth remarking that the equational theories of NetKAT and CNetKAT 
are not equivalent: there are equivalent programs in NetKAT, that cannot be 
proved equivalent with the CNetKAT axioms, as the following example illustrates. 
Consider the program p + drop for p a packet program without parallel. In 
NetKAT, because the + is interpreted as multicast, this program is provably 
equivalent to p: executing p on your input packet while at the same time also 
dropping a copy of the input, has the same outcome as just executing p. In 
CNetKAT, however, this is not the case. Instead, the +-operator is interpreted 
as non-deterministic choice and in the semantics of p + drop we get the trace 
1- Ø, representing the choice of dropping all the packets, which is not present 
in the semantics of p. Hence, this axiom is unsound (p + drop Æ p), and instead 
the alternative axiom p || drop = p holds, reflecting the fact that || is multicast. 

We now show CNetKAT semantics is equivalent to the POCKA semantics on 
state programs. In [37], POCKA terms are what we defined as state programs 
over the alphabet OUAct, and they are interpreted in terms of pomset languages 
over assignments and states, encoded as partial functions, similarly to separation 
logic [33]. The POCKA semantics are defined in two steps: the first step results 
in a set containing all pomsets that can be derived directly from the terms, and 
in a second step this set is closed under two laws—exch and contr—that account 
for all traces that can be built in parallel threads (including simple interleaving). 


Definition 5 (POCKA semantics). Leto € O, e € Act, p,q © Tetate(O U Act). 


(o) = St* © [o]o © St* (p;q) = (p) © da) (skip) = {1} (abort) = Ø 
(e) = St* © {e} © St* dp |ia) = (o Idd @)=0* (p +a) = (p) U da) 


The semantics of a POCKA expression p is [Pp] pocka = qp) 2, 


Theorem 2. CNetKAT is a conservative extension of POCKA: if p is a POCKA 
term (p € Tstate(O U Act) ) then fora # Ø, [pl] (a) = {u -a | u € [plpocka}- 
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3.4 Axiomatization 


We introduce notation to describe packets and sets of packets axiomatically. Let 
fi,---, fk be a list of all fields of a packet in some fixed order. Then for each tuple 
N = N1,..., Np we obtain expressions fi = ni- fk = Npk and fi — ni: fk 
nk, which, similar to NetKAT, we call complete tests and complete assignments. 
Complete tests are also referred to as atoms, because they are the atoms of 
the Boolean algebra generated by the tests. We denote the set of atoms by At, 
complete tests with a and complete assignments with m. There is a one-to-one 
correspondence between complete tests and assignments according to the values 
of n. For a € At we denote the corresponding complete assignment by Ta, and 
if 7 is a complete assignment we denote the corresponding atom by a,. 

There is also a link between sets of packets and terms of the form ||,_; mi. For 
each set of packets a, we take the set {7; | i € I} of complete assignments such 
that each 7; corresponds to a packet of a, and combine them in parallel. Formally, 
for a set of packets a there exists an expression ||;ez Ti, that we denote with Ma, 
such that on any input b 4 ©, [JZ,](b) = {1 - a}. Similarly, the semantics of an 
expression of the form ||; Ti on any input is always {1 -a} for some a € 2°*. 
We use the notation Ia as a syntactic representation of set of packets a. 

CNetKAT has the structure of a Kleene algebra on state programs, enriched 
with additional axioms. Tests form a Boolean algebra and state observations 
a pseudocomplemented distributive lattice (PCDL). The test and observation 
structures are subject to interaction constraints. The packet processing behavior 
is captured by the packet axioms, which contain axioms for individual packets 
and sets of packets. The axioms governing the parallel operator are partially 
familiar from earlier work on BKA [13,25]. There is also the exchange law familiar 
from CKA. Lastly, we have axioms for the interactions between state programs 
and packet programs. The full set of axioms is described in Figure 4. We write 
= for the smallest congruence on Prg generated by the axioms in Figure 4. 


Remark 9 (When is IT, equal to drop?). II, = drop if and only if a is empty. 
Ig = |licøri = || = V S = drop. For all other a, we have Ha 4 drop. 


There are a few subtleties to notice in Figure 4. First, we point out the 
interaction between drop and abort. When no packets are present, not even abort 
can be executed. Hence, if we drop all packets and then abort, the abort does 
not happen: drop ; abort = drop. On the other hand, if we first abort and then 
drop all the packets, the behavior is equal to just aborting: abort ; drop = abort. 

In the axioms of the parallel operator, the axiom s || skip = skip from BKA is 
missing; it only holds when s is a state program, and can be found in the local 
state vs global state axioms. In addition to the familiar BKA axioms, there is 
the axiom drop || p = p, in contrast with abort || p = abort. 

The local state vs global state axioms capture the interactions between the 
global pomset and the output packets. The first one, Ma ; dup = Ha ; a, captures 
the intuition that if we know the input is a (due to Ha, which, as a parallel of 
complete assignments, essentially overwrites any non-empty input set to a), then 
we know the dup is recording an “a”. The second axiom, IJ, ;w = w ; Ma states 
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Kleene Algebra axioms S © Tetate 
p+(q+r) = (p+a)t+r 
p+q=q+p 

p+ abort = p 
prp=p 
ps(qs7) = (p;q);r 
s ; abort = abort 
abort ; p = abort 
p;skip = p= skip; p 
p3(qtr)=p;qt+p;r 
(p+q);r = p;r+q;r 
p* = skip + pp* 
p+qg;rSq>p r Sq 
p* = skip + p*p 
ptgrSr=>qg psr 
Packet axioms £ E Tdet—pack 
fan; f'm = f'm; f=n (fF) 
fen; fem = f'm; fen (FEF) 
f=n; fn = f=n 
fen; f=n = fen 
fem; fen = fen 
s||r=zx 
z; (p |) = @:p) Il (e734) 
(p || qa); = (p; 2) || (4; 2) 


Local vs global state y,z € Tpacket, 
S, U € Thate; we Tstate(O U Act U QPk) 


TI,;dup = Ha;a (a € 2°) 
Ha;w=w;H, (a €25) 
drop ; p = drop y ; drop = drop 
s || skip = s 

(s; y) (sz) = (s I| v); (y Il 2) 
Extensionality 


Va € 2 (Ha ;p = Ha; >p=q 


Parallel axioms 


pl alr) = @lla Ilr 


p || abort = abort 
drop || p = p 
pil(@q+r)=pllat+pllr 
Plla=allp 


f 1 
Exchange law 8,8 ,U,U E Tstate 


(s Ils); (v ll v’) S (5 v) I] (s; v) 


Packet pred., state obs. axioms 
VE {V, VB}, A € {A, AB}, a,b,c€ BUO 
a\b=bAa 
aN (bAc) = (aAb)Ac 
aV(aANb)=a=aA(avb) 
aV (bAc) = (aV b) A (aVc) 
aA (bVc) = (aAb)V (adc) 


Additional state obs. axioms 


0=oAT 
os do ond 
v=nANv=m= 1 (n#m) 
pens Vuvu=m 
ném 
Nivi = ni S Vii =m (i A j.vi F vj) 


Additional packet pred. axioms 
t Vg pass = pass = t Vg œt 


f=n ^g f=m 


Interface axioms 
o^d <0;0 oVo 


o+0 (0,0 € O) 


abort = L skip = pass (e € Act) 
T;o<o o;T So (t,t € B) 
T;eSe e;T <e 
tABt =t;t tVet’ =t] ť 


Fig. 4: Axioms of CNetKAT. The left column contains the KA axioms, the packet 
axioms, the axioms for the interaction between the local and global state, and an 
extensionality axiom. The right column axiomatizes the ||, the algebra of packet 
tests (which is a Boolean algebra), and the algebra of partial state observations 
(which is a PCDL). The interface axioms connect both the lattice operators to 
the Kleene algebra ones. We write e < f as a shorthand for e+ f = f. 


that for dup-free state program w, we can flip the order between changing the 
set of output packets or performing the state changes in w, as long as a is not 
the parallel representing the empty set. This latter condition is crucial: if a = Ø, 
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then IT, = drop, and drop ; w = drop (the global state changes in w do not get 
executed if we have no packets). 

The axiom drop ; p = drop for any program p captures the intuition that if 
there are no packets, nothing happens anymore. The other way around, y;drop = 
drop is only true for y a packet program; if it was a state program, the global 
state changes get executed if we start with a non-empty set of input packets, 
making the behavior of y ; drop not equivalent to drop. 

Lastly, extensionality says that if two programs are equivalent on all inputs 
(ie., a € 2P*), then the programs are equivalent. It is not clear whether this 
axiom is derivable from the others; we hope to settle this question in the future. 


4 Soundness and Completeness 


In this section we prove soundness and completeness of the CNetKAT semantics 
w.r.t. the axiomatization from Figure 4. For soundness, we prove that if programs 
p and q are provably equivalent using the axioms, they have the same semantics: 


Theorem 3 (Soundness). For all p,q € Prg, if p =q, then [p] | = [dl]. 


Conversely, we will prove that if p and q have the same semantics on all 
inputs a, then p = q. We structure the completeness proof in four parts: 


1. Define a normal form for CNetKAT programs, and show that for every input 
set a, every program is provably equivalent to a program in normal form 
in which a is incorporated. In other words, the normal form of a program 
is dependent on the input. Similar to NetKAT, normal form programs are 
CNetKAT expressions over complete assignments. We show that we have a 
simplified set of axioms on complete assignments and tests. 

2. Obtain completeness for [7,-shaped programs from NetKAT completeness. 

3. Using completeness of POCKA, obtain completeness for programs of the form 
s ; Ia (and sums thereof), where s is a state program. 

4. Lastly, we combine these results to prove that if p and q have the same 
behavior on input a, the program IJ, ; p is provably equivalent to I, ; q. 
Step 1: Normal form We prove that for every a € 2P*, we can write any 

program p as II, followed by a sum of state programs followed by a parallel of 

complete assignments. This is the most difficult step in the completeness proof. 
We derive a few equivalences from Figure 4 regarding complete tests and 
assignments that make the proof of the normal form easier. We refer to these 

axioms as the reduced axioms. For a and $8 complete tests such that a 4 B, m 

and 7’ complete assignments, and a € 2° b € 2Pk, we can derive: 


ne ? 
TENT; Ar QEA; Ta TINT ET a; B = drop IT, ; Me = My 


All of these equivalences are easy consequences of the packet axioms, the 
packet predicate axioms, the axiom tAgt’ = t;t' and the fact that for all packet 
programs p we have p ; drop = drop = drop; p [3]. The last reduced axiom is 
derived in the full version of this article [38, Lemma 14]. 
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Theorem 4 (Normal form). Let p € Prg anda € 2P*. There ezists a finite 
set J, and elements uj € Tstate(O U Act U 2Pk) and bj € 2Pk for each j € J s.t. 


Hy ;p = Ha; X (uj; M) 
jEJ 


Sketch. The proof proceeds by induction on the structure of p. For instance, for 
an assignment f<-n, where we take Ma = ||kexnTp for some non-empty finite 
index set K and complete assignments mk, we derive 


Ig; fn = a; Ha; fen (la ; Hy = Ty) 
= 1T,;( || Tk); fen 
kek 
= Ma; || (mk; fen) ((p || a); 2 = (p; x) || (q;2)) 
kek 
= II,;skip; || 77, (p ; skip = p) 
kek 


where m, is mg with the assignment for f replaced by fin. If K = Ø then 
IT, = drop and the equivalence above follows immediately. The most difficult 
case is the star; we use an argument that relies on the fact that matrices over 
a Kleene algebra form a Kleene algebra [20]. A proof can be found in the full 
version of this article [38, Appendix D] 


Step 2: Completeness for I/,-shaped programs As mentioned, Ha- 
shaped programs are syntactic representations of packet sets. We prove that if 
two such programs result in the same set of packets on any non-empty input, 
they are provably equivalent, using that Ia describes a unique set of packets. 


Lemma 6. Leta © 2Pk 


ne ? 


and b,c € 2°*. If [I] | (a) = [1c] | (a) then M, = Me. 


Step 3: Completeness of sums in the normal form We first prove 
completeness for state programs, where we use completeness of POCKA. To do 
so, some caution is needed; POCKA terms are state terms over the alphabet 
O U Act. However, the state terms relevant here also include elements a € 2P*. 


Lemma 7. Let s,v € Tetate(O U Act U 2°") anda € 2P*. If [s] | (a) = [vu] | (a), 
then s =v. 


Next we prove completeness for expressions of the form s ; Ma, and then 
extend this to arbitrary finite sums of such programs: 


Lemma 8. Let b,c € 2PK, u,v state programs, and a € 2°, Then we have: 
[u; Me] | (2) = [v; He] | (a) > u; Mh = v; Me. 


Lemma 9. If [Dyer (us : Il, J| | (a) = [Spex (ve; Ho,)] | (a) for some a € 


QPk then jeg (Uy ; Me) = Vex (ve ; He), where J,K are finite; uj, vp are 


9Pk 


state programs and bj, Ck € for each j,k. 
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Step 4: Completeness The last lemma before proving completeness relates 
the semantics of p on input a to the semantics of Ia ;p on any non-empty input. 


Lemma 10. Let b € 2PK, a € 2°. For all p € Prg, [Ha;p] | (©) = [p] | (a). 


Theorem 5 (Completeness). Let p,q € Prg. For all a € 2°* we have that if 


[pl | (a) = [al] (a), then p = q. 


Proof. We first show that IT, ;p = Ha ;q for all a € 2P*. In case a = Ø, Ha must 
be the empty parallel. Hence, IJ, ; p = drop = Ha ; q. In the rest of the proof 
we assume a # Ø. Via Lemma 10, we obtain that [p] | (a) = [Ma ;p]] (a) = 
[Ma ;q]] (a) = [all (a). We obtain a normal form such that Ma ; p = Ma ; 
jeg (Uy ;IT,,) (Theorem 4). Similarly, Ma ;q = Ha ; dope x (Vk ; He,). Via sound- 
ness we derive [a ; Djeg (Uy : m) (a) = [Das Yipee i; He )] | (a), and 


via Lemma 10 that PAT : m|| (a) = [Spew(ve ; He,)] | (a). With the 
partial completeness result from Lemma 9, we obtain that >> jeJ (uj ; I, ) = 
Sper (Vk ; Hep). This leads to 


Hy ;p = Ha; X (wj; I) = Ha; XO (wr; Ha) = Haj 
jEJ kek 


Hence, we have derived that Ha ; p = Ma ;q for all a € 2°*. With the 
extensionality axiom we can conclude that p = q. 


5 Examples 


This section shows how we can use CNetKAT to model and analyze several 
concurrent programs. We start by analyzing the running example from §2, and 
then proceed to a more involved example that combines the behavior of a stateful 
firewall, a load balancer, and an in-network cache. 


5.1 Running Example 


Consider again the running example from §2. Because we are ultimately inter- 
ested in the behavior of the program when the packets have reached their final 
destination, switch 4, we will add a test sw=4 at the end of the program: 


p = (v+-0) ; (p1 || p2 || ps || pa)” ; (sw=4) 


Recall that the CNetKAT semantics of a program contains traces that are only 
required to model executions where the program is composed in parallel with 
another program, to ensure a compositional semantics for the language. However, 
to analyze the behavior of a program in isolation, we want to eliminate these 
extra traces. To do this, we follow the same strategy used in [37], where so-called 
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guarded pomsets were proposed. Guarded pomsets are a subclass of pomsets 
that captures the characteristics of behaviors of (concurrent) programs running 
in isolation. For example, in a guarded pomset, if one assertion, say v=0, occurs 
before another assertion, say v=1, there must be an assignment v41 between 
the two asserts to account for the change. That is, in an isolated execution every 
change to variables must be explained by an action in the program. 

To illustrate the difference between pomsets and guarded pomsets, consider 
our example. We unfold the Kleene star twice and evaluate the resulting program; 
we obtain a pair with output {@[4/sw], V[4/sw]} and corresponding pomset, 


b — {a} > &[2/sw] — {@[2/sw]} — {@[4/sw]} 
(v0) > {07,4} 
j {9} — {083/sw]} — (v1) — {0[3/sw]} — {9]4/sw]} 


where 6(v) = 1. This pomset is unguarded: 6(v) = 1 occurs without a cause. 
The semantics also contains a pair with {@[4/sw], V[4/sw]} and pomset, 
x B > {a} ——> &[2/sw] — {@[2/sw] } — {@[4/sw)] } 
Y> (v+0) sa > {9, a} 
j {9} > {0[3/sw]} > (v1) > {0[3/sw]} — {9[4/sw]} 


with a(v) = 0, (v) = 1, and y unrestricted. This pomset is guarded because it 
contains an arrow from v+1 to 8, justifying the change in valuation from a to 
6. As we show in the full version of this article [38, Appendix E], all guarded 
pomsets in the semantics will have this arrow, and satisfy the desired property: 
Q packets are observed at switch 3 before @ packets are observed at switch 2. 

Now consider the axiomatic claim we made in §2 (i.e., (2)), (Q || @);¢ < © || 
@);p where q is the program from (1). We can easily see that the following holds: 
[a] | {9, @} C [p] | {9, A}. Hence, we can use Lemma 10 and the completeness 
result for CNetKAT (Theorem 5) to obtain (2). 


5.2 Stateful Load Balancer, Cache, and Firewall 


For a more complex example, consider the network in Figure 5, which is adapted 
from an example from [2]. The overall goal is to (i) prevent packets from a high- 
priority server Sp going to low priority hosts l,,...,J, and (ii) load balance 
requests to the servers in a round robin fashion. We provide naive specifications 
for the cache, firewall and load balancer programs in Figure 5. For simplicity, we 
assume that there is exactly one low-priority host, and exactly one high-priority 
host, i.e., n = k = 1, and we leave the specification of the topology implicit. 


Remark 10. In contrast with the previous example, the program in Figure 5 
includes reads and writes of a global variable that occur on different physical 
devices. In principle, synchronizing variables like r would give rise to additional 
packets that update local copies of variables—a process that could itself be 
modelled in CNetKAT. We leave the implementation of a translation pass that 
achieves the synchronization of global variables across switches to future work. 
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C £ ((v=1); (dst~-h1) ; dup + (v=0) ; (dst+l1) ; dup) 
|| (src = lı ; (dst<firewall)) || (sre = hi ; (dst<firewall)) 


lk 
x» 


F £ (src = sp ; (v0) ; (dst¢-cache)) || (sre = sı ; (v¢-1) ; (dst -cache)) 
|| (sre = lı ; (r4+—0) ; (dst<-loadb)) || (sre = hy ; (r41) ; (dst<-loadb)) 
L £ ((r=1) ; (dst~s;,) ; dup + (r=0) ; (dst+s:) ; dup) 
|| src = Sp ; (dst<firewall) || src = sı ; (dst<firewall) 


Fig. 5: Stateful firewall between high/low priority hosts and servers. 


In [2], the authors point out a problem with the example that arises because 
the cache has no means to enforce the security policy. One strategy for resolving 
this problem is to swap the placement of the firewall and the cache. Another is 
to distribute access control rules onto the cache as well as the firewall. However, 
there is also a second, more subtle issue: the load balancer uses the global variable 
r to decide to which server to forward requests. In the presence of multiple 
packets, another packet may arrive before the change to the global variable 
occurs allowing two (or more!) packets to be sent to the same server. 

The issue with the load balancer can be observed in the following exam- 
ple. Take as input packets @ and Ọ with @(src) = O(src) = lı. After being 
processed at the cache, both packets arrive at the firewall. One of the pairs 
in the semantics of the firewall F is the following, with a unrestricted and 
B(r) = 0: (a > (r+0) > B)- {V]loadb/dst], &[loadb/dst]}. After processing 
by the load balancer, both packets are sent to s; simultaneously. To illustrate 
this event, we claim that there is a guarded pomset in the semantics of the 
load balancer. Observe that in the semantics of L we find the following pom- 
set, with a and 8 from before (the second 8 is the result of the r=0 in L): 
a —> (r0) > 8 > 8 > {Y[s,/dst], &[s;/dst]}. Using closure under contraction, 
we obtain a guarded pomset (the two 6-nodes are merged into one) where both 
packets appear at sı at the same time. 

A final issue stems from the fact that the firewall implementation is flawed 
as written. Specifically, it uses a global variable to determine whether a packet 
should be forwarded on to a high priority host. Of course, if another packet 
arrives before the current one has been forwarded, the value of this variable 
might change, resulting in both packets being forwarded to a low priority host. 

The issue with the firewall can be observed as follows. Take as input two 
packets @ and Q with @(src) = sn and O(src) = s;. After processing by the load 
balancer, both packets end up in the firewall. One of the pairs in the semantics 
of the firewall is the following, with a(v) = 1 and 6 unrestricted: (a - v+0 || 8- 
v<1)-{Y[cache/dst], &[cache/dst]}. After processing by the cache, both packets 
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Algorithm 1 Leader logic. instance+0 ; ( 
1: Initialize State: 
2: instance := 0 msgtype = REQUEST; 
3: upon receiving pkt(msgtype, inst, rnd, vrnd, swid, value) $ 
EE msgtype+—PHASE2A; 
5: case REQUEST: rnd+—0; 
6: pkt.msgtype + PHASE2A P A 
7 pkt.rnd + 0 inst4—instance; 
8 pkt.inst < instance : : . 
9: instance := instance + 1 instance<—instance + 1; 
10: multicast pkt to acceptors dst<-1 | eie | dstek 
11: default : 
12: drop pkt ) + drop 


Fig. 6: Leader logic from [9] and CNetKAT term, with k acceptors. 


are sent to hı or lı. To illustrate how the packets travel to e.g. l1, we find the 
following pomset in the semantics of C, with a, 8 from before and 7(v) = 0: 


a — (v+0) 


Z7 > {Vl /dst], [ls /dst]} 


B —> w1) 


This pomset subsumes a guarded pomset. Hence, by exchange closure, we 
find guarded pomsets in the behavior of C where the packets both end up at l4. 
Overall, these examples show that CNetKAT can model subtle interactions 
between packets that arise in the presence of concurrency and state. Moreover, 
the axiomatic semantics can be used to prove (in)equivalences between programs. 


6 Related Work 


The core of CNetKAT is two extensions of Kleene Algebra: NetKAT [3,10], a net- 
working extension of Kleene algebra with tests, and POCKA [37], a concurrent 
extension of KA. NetKAT describes how single packets move through a network, 
whereas CNetKAT can handle multiple packets. POCKA was introduced to de- 
scribe concurrent interactions of global variables, whereas CNetKAT makes use 
of this algebra to enable intra-packet communication. CNetKAT captures local 
and global state interactions which was not in any of the previous work. 

In the family of KA extensions, POCKA is closest to Concurrent Kleene al- 
gebra with Observations (CKAO) [15,16], which was proposed to integrate con- 
currency with conditionals such as if-statements and while-loops. Contrary to 
CKAO, which uses a Boolean algebra to axiomatize conditionals, POCKA uses a 
pseudocomplemented distributive lattice (PCDL) as the algebra for tests, which 
are referred to as observations to mark the difference. The idea to use a PCDL 
as the algebra for observations was first proposed in [14]. 

Our work fits within the CKA tradition, which gives a true concurrency se- 
mantics and is thereby distinct from bisimulation semantics typically considered 
in process algebras, such as CSP and CCS. Another distinction is that CNetKAT 
uses global state rather than message passing. 

Some recently published work has also extended NetKAT with constructs 
for modeling multi-packet behavior [7]. Here the goal is to model interactions 
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between the control- and date-plane in dynamic updates. Parallel composition 
is axiomatized with a left-merge operator and a communication-merge operator, 
and semantics is in terms of bisimilarity instead of traces. The examples largely 
focus on the table updates, not on the flow of packets through the network. 

The current paper deviates from earlier concurrent variations on NetKAT, 
such as Concurrent NetCore [35] and a stateful variant of NetKAT introduced 
in [31]. Both have a different algebraic structure than NetKAT. Concurrent Net- 
Core does not have Kleene star, and does not provide a denotational semantics, 
or axiomatization. Moreover, it does not handle multiple packets, the use of + 
in the language is multicast rather than non-determinism, and || is concurrent 
processing of disjoint fields of the same packet. Because of these restrictions, 
concurrent NetCore is less suitable to specify inter-packet concurrency. 

The approach in [31] models interactions among multiple packets, but is 
accompanied by semantic correctness guarantees, rather than algebraic formal- 
izations as in CNetKAT. A recent PhD thesis [29] contains another version of 
stateful NetKAT, which assumes packet processing can always be serialized into 
a deterministic, global order. This assumption enables a simpler semantics and 
a decision procedure, though completeness is left as an open problem. Flow con- 
trol in [29] is handled in the style of Guarded Kleene Algebra with Tests [22,36], 
which means that programs and specifications must be deterministic. 

More broadly, there is a growing community doing research on network veri- 
fication tools. Early work such as HSA [18], Anteater [30], Veriflow [19], Atomic 
Predicates [39], etc. focused on stateless SDN data planes, while more recent work 
such as p4v [27] and VMN [32] supports richer models such as P4 and stateful 
middleboxes. These tools typically use analyses based on symbolic simulation 
or they encode verification tasks into first-order formulas that can be checked 
using SMT solvers. To the best of our knowledge, CNetKAT is the first algebraic 
framework to model network-wide, multi-packet interaction with mutable state. 


7 Discussion 


We proposed CNetKAT, an algebraic framework to reason about programs with 
both local and global state, in the presence of parallel threads and control-flow 
statements. We provided a denotational semantics and a complete axiomatiza- 
tion. We also provided examples of how the language can be used to reason about 
stateful network programs and different sources of concurrency in a network. 
As a result of the algebraic approach, the semantics of a program arises from 
the semantics of its parts. This clashes with the idea of observational equivalence 
when concurrency comes into play: some behaviors of a program can only be 
observed when executed concurrently with another program, and not in isolation. 
Hence it becomes necessary to include some elements in the semantics that do not 
immediately correspond to observable behavior. This implies that observational 
equivalence is not the right notion for axiomatising the semantics. However, 
using the greatest congruence contained in a notion of observational equivalence 
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is interesting; this guided us in the development of our axiomatisation but it 
remains to be shown that our axiomatisation is indeed the greatest congruence. 
CNetKAT relies on a classic approach to proving program correctness: develop 
a framework can model both specifications and implementations, and show that 
equivalence is decidable. Past experience with NetKAT suggests that this ap- 
proach is usable, although CNetKAT lacks a procedure to check semantic equiva- 
lence, or at least membership of a given pomset. Devising an efficient procedure 
for this task is our immediate priority. The procedure will most likely rely on 
automata models such as fork automata [28] or Petri automata [6,5]. 
Ultimately, we would like to use CNetKAT to reason about stateful and dis- 
tributed P4 programs. A target case study is provided in [9], which implemented 
Lamport’s Paxos algorithm in the forwarding plane. To show correctness, the 
authors used a translation to Promela, a model checking language, and specify 
check that learners never decide on separate values for a single instance of con- 
sensus. This property is closely related to guarded pomsets. We would like to use 
CNetKAT to show correctness of the P4 implementation of the protocol directly 
(translation from the P4 code is almost direct, see Figure 6 for an example). 
The reader will notice that the CNetKAT expression in Figure 6 uses an action 
of the form f<v, where f is a field (inst) and v a global variable (instance). 
Adding actions of the converse form v+ f is trivial since the packet logic specifies 
that f always has exactly one value. However, actions f<-v require more care: 
the value of global variables can only be determined at the end since parallel 
threads might change it while it is being copied. To accommodate this in the 
semantics, we will have to allow partially defined packet fields and determine 
the missing field values at the end (when we check for guarded traces). 
Another exciting direction for future work is the development of a library of 
litmus tests for networking in the spirit of [1]. Litmus tests are carefully crafted 
concurrent programs operating on shared memory locations that expose subtle 
bugs in memory models of hardware. One could imagine using the guarded pom- 
sets semantics to discover minimal witnesses of undesired concurrent behavior. 
We would also like to investigate the memory model of CNetKAT; this would 
give insight into the rules followed by operations on the global state. For a partial 
answer, we can look at POCKA. The guarded fragment of the POCKA semantics 
was shown to be sequentially consistent (concurrent memory accesses behave as 
if they are executed sequentially [24]), as it passed the store buffering litmus 
test [1]. The guarded fragment of the pomsets recording global variable changes 
is expected to pass this litmus test as well. It is worth investigating whether 
CNetKAT also supports other weak memory models, such as linearizability. 
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